Dear all, In machine learning, the most common way of splitting a dataset to obtain the training and test datasets is to randomly allocate the cases into one or the other taking into consideration the predefined p% of cases to be included in the test set. This approach will lead in some cases, particularly when using small datasets, to unbalanced training and test sets, that is, the training set may not contain extreme values of the predicted variable and/or combinations of the input variables. After training, the "whichever" learning algorithm is being used, will have, most likely, poor prediction ability and as a consequence a smaller applicability domain.

In what circumstances you decide not to randomly assign the cases but to force some of them to be present in the training set?

Do you, for instance, stratify your dataset into classes/bins reflecting the value of the variable(s) to be predicted and allocate a certain p% of cases of each class into the test set and (1-p%) onto the Test set, instead of randomly selecting the p% cases from the whole dataset? Do you stratify according to the predicted value or according to the input variables values?

Regards, Luis

More Luis F. Gouveia's questions See All
Similar questions and discussions