In designing classifiers (using ANNs, SVM, etc.), models are developed on a training set. But how to divide a dataset into training and test sets? With few training data, our parameter estimates will have greater variance, whereas with few test data, our performance statistic will have greater variance. What is the compromise? From application or total number of exemplars in the dataset, we usually split the dataset into training (60 to 80%) and testing (40 to 20%) without any principled reason. What is the best way to divide our dataset into training and test sets?

Similar questions and discussions