The ROC curve, and the area under the curve (AUC) parameter derived from it, are an important tool for optimising classifications.
As I understand it the ROC is derived by varying the decision line, recording how the true positive and false positive fractions as the line is adjusted and a curve is plotted from the resulting data. I have often heard the AUC claimed to measure the probability of correctly classifying a random sample.
Surely it is in fact measuring the probability of correctly classifying a sample when independently and randomly defining an arbitrary decision line, i.e. two simultaneous uncertainties. When you have finalised your test protocol based on the calibration/training set you no longer choose an arbitrary decision line, you use the one selected in the calibration phase. To me this means you can no longer calculate the ROC as you can no longer vary the decision line. this would make the ROC/AUC fundamentally unsuitable for validation/test sets.
In Hanley and McNeil (Radiology, 1982, vol 143, pg 29-36) it is defined subtly but significantly different and more accurately, in that it is the probability of correctly ranking a pair of one true positive and one true negative in their true order (not necessarily getting both right). However, since we ultimately want classification models to make a hard decision on what class a sample belongs to, the relative rank of a pair is not important in a test set, just what the probability of being correct?
I would be inclined to use measures such as Odds ratios, from which significance can be calculated in both calibration and validation sets.
I would be interested if anyone has insights or can recommend some useful background reading on the issue.