I have one question regarding the CIs of the AUROC calculated merging/pooling the predictions coming from different test sets.
In one analysis, we use a sort of nested cross-validation approach, even if the outer loop is a more properly a test and not a validation loop. The dataset was split in 5 folds, the whole analyses were replicated five times using the other four folds as training set and the remaining fold as a test set. The same technique was used in each of the five repetitions, optimizing the hyperparameters with an inner loop cross-validation strategy (which varies among the five repetitions). Then, the algorithm was used to generate continuos predictions for the cases in the test set. These test predictions were used to take no decisions/make no comparisons. The predictions of the 5 test sets were then pooled and the total sample test AUROC was calculated.
Given the procedure we used, is it correct to calculate the pooled AUROC CIs with either the common asymptotic strategy or via a stratified bootstrap approach that directly resamples form the distribution of the pooled test predictions?
Or instead, the only correct way to calculate the CIs is to bootstrap n times the training sets, retrain the algorithms, generate predictions in the test sets with the new algorithms and finally calculate the new AUROC? CIs will come from the distribution of AUROCs obtained repeating this procedure n times.