I trained a VGG-16 model on a highly imbalanced dataset where the positive samples (class-1) were only 20% of the negative samples (class-0) ( # positive samples: 100 and # negative samples: 500). The trained model was evaluated on a test set with equal number of positive and negative samples (n=100 each). I calibrated the model outputs using temperature scaling to rescale the probabilities to represent the true distribution of positive samples. The table below shows the performance of the baseline (uncalibrated) model and that obtained after calibration using temperature scaling.
I observed that the baseline model at its default threshold (T=0.5) did no good. However, after temperature scaling of the baseline model outputs, the performance greatly improved in terms of all metrics. The log loss, Brier, expected calibration error (ECE), and Maximum calibration error (MCE) obtained with the recalibrated probabilities were the least. I then identified the optimum operating threshold (T=0.18) using Geometric Means for the baseline model to obtain the best trade-off between sensitivity and specificity. The baseline model performance greatly improved as well here. After recalibrating the probabilities using temperature scaling, I identified the optimum threshold (T=0.48) using Geometric Means. I am surprised to see that the performance obtained with optimum threshold (T=0.18) using the baseline model uncalibrated output probabilities and the performance obtained with the optimum threshold (T=0.48) for the temperature-scaling-based recalibrated output probabilities were exactly the same. How could I interpret this behavior?