Assume that, after running some hyperparameter optimization technique based on the training data and possibly using cross-validation to guide the search, there are M best models avaliable to create an ensemble.
How can the performance of the method be assessed?
Some thoughts:
- The simplest way would be selecting just a set of M models (resulting of only one optimization run), create N replications (i.e. random train/test splits) and get the statistics. I believe this is a highly skewed approach, as it only consider the randomness of the ensemble creation, but not the randomness of the optimization technique.
- A second alternative would be creating N replications and, for each one of them, running the entire method (from hyperparameter optimization to ensemble creation). Then, we could look at the statistics for the N measures.
- The last alternative I can think of is creating N replications and, for each one of them, (1) run the optimization algorithm and (2) create K ensembles. In the end, we could extract statistics from all K*N measures.
Since I wasn't able to find good references on this specific issue, I hope you can help me :)
Please, cite published work that supports your answer.
Thank you.