I need to know do I need to calculate the above metrics by selecting every available trained weight with all the test data to evaluate the model performance? Or do I only need to calculate those metrics for the best weight file I was chosen as my final model for the complete test set? The reason for asking this is in the below reference the author of the repository calculates both metrics by taking every weight file for the complete test set.
Reference: https://github.com/olgaliak/segmentation-unet-maskrcnn/blob/master/maskRCNN/main_eval.py
Please refer line 152 .