I am training a deep learning model toward a image classification task. The VGG-16 model is trained individually on two different training sets with varying degrees of data imbalance. Set-1 had only 25% positive samples (class-1 (positive, label:1): 100; Class-0 (negative, label:0): 400). Set-2 had only 80% positive samples (class-1 (positive, label:1): 320; Class-0 (negative, label:0): 400). Set-1 is highly imbalanced compared to Set-2. The individual trained models were evaluated on a common test with equal numbers of positive and negative (n=100 each) samples. Since the dataset is imbalanced in both Set-1 and Set-2, the output probabilities would not be calibrated, so, I applied temperature scaling to rescale the probabilities. Figure (a) and (b) shows the results before and after calibration for the model trained on Set-1 and (c) and (d) for the model trained on Set-2. I observe that the expected calibration error and maximum calibration error reduce after temperature scaling and the probabilities are scaled to closely follow the y=x diagonal. However, it is not clear how to interpret the pre-calibration and post-calibration curves in terms of data imbalance. How the fraction of positives in each confidence bin, pre- and post-calibration, can be explained in terms of data imbalance?