Does anyone know how much data from the whole data set should be selected for model selection? Particularly in case of imbalanced data, how should we select a portion of data for model selection?
A common rule of thumb is to reserve about 20% of your data for model selection (and 60% for training and 20% for testing).
For imbalanced data you should try to keep the proportions among classes on your training, model selection and test sets. One way to go is to sample at random the corresponding proportion on each class.
For example, say that you have 1000 samples and that you have two classes. One of them, class 1, accounts just for 10% of your data. Then you may want to sample to build your model selection set as follows:
20 samples at random from class 1 (20% of 100)
180 samples at random from class 2 (20% of 900)
You can also use other more advanced approaches like k-fold cross-validation or bootstrap. But the key point is to make sure that the proportions of your classes are kept in the training, validation and test sets. Otherwise you could end training on samples of just one class, which lead to poor generalization.
You can devide the data set to 80% for learning and 20% for testing. For the learning subset, you can try the Bootstraping samples, where you use the elements "out of bag" for the ensemble selection.
Chawla, N. V. (2005). Data mining for imbalanced datasets: An overview. In Data mining and knowledge discovery handbook (pp. 853-867). Springer US.
Also take note that in:
Japkowicz, N. (2000, June). The class imbalance problem: Significance and strategies. In Proceedings of the 2000 International Conference on Artificial Intelligence (ICAI’2000) (Vol. 1, pp. 111-117).
concluded that " standard multilayer perceptron is not sensitive to the class imbalance problem when applied to linearly separable domains". So the classifier used also has an impact on the sensitivity to class imbalance