I am testing my algorithm on 20 different datasets. So, as a result I derived 20 F1 scores based on precision and recall for each run (dataset). Is there any way to combine all these 20 F1 scores into one metrics or summary statistics to report how the algorithm performed on the 20 datasets?