I have studying the size of my training sets. I am wondering if there is an "ideal" size or rules that can be applied. I am thinking of a generative hyper-heuristics that aim at solving np-hard problems that require a lot of computational resources.
Normally 70% of the available data is allocated for training. The remaining 30% data are equally partitioned and referred to as validation and test data sets. Partitioning ratio is an important aspect but, apart from this one must ensure that the population statistics of these data sets are marginally different from that of the overall data. It should also be ensured that the training dataset should include all possible patterns used for defining the problem and should extend to edge of the modeling domain.
Suppose we are generating some algorithms, then we suppose the patterns are the one for problem to solve. Not the patterns of instructions that make the algorithms.
Thank you very much for this answer. I really appreciate it.
The data I work with are solutions of the Traveling salesman problem and other NP-hard problems. Finding a tour can take up to 10 seconds with a short runs. Finding solutions for more demanding problems can double or triple this time. As a result, it can become infeasible to run 100 instances even on a cluster. The computations are highly intensive for the generative hyper-heuristics. Unlike neural network, this can take quite a bit of time ...
Do you have any paper or code/application repository that I could have a look? I'm interested in using this cross-validation to help me achieve a better training of a Neural Network.
I don't think there is anything 'ideal' ratio for splitting a dataset. It depends on the type of dataset. I would suggest to try different ratios i.e 80-20, 70-30, 65-35 etc and to pick the ratio that gives the best performance result.
There is no fixed rule for separation training and testing data sets. Most of the researchers were used 70:30 ratio for separation data sets. It is also depends on data characters, data size etc. You can used 70:30; 80:20; 65:35; 60:40 etc. anything which suits of your data characters.
I've published two journal articles where I split my data in 7 groups: 5 for training, 1 for validation and 1 for generalization. In short: about 71,4% of data for training and 28,6% for validation and generalization. It worked very well for me!!!