If you have enough data to spare half for testing without compromising learning, then you can use 50:50 or even 2-CV or 5x2CV (a repeated cross-validation which used to be recommended following a paper by Dietterich). The CV approaches can be combined with statistical tests (see Dietterich (1997-8): t-test, Apaydin (1998-9): F-test). Another approach is Bootstrapping with the state of the art being .632+ boostrapping (and simple .632 boostrapping is something like 70:30 selection except that we are selecting randomly with replacement so we can get duplicates and so get a full size training set and about 36.8% left out by chance - thus it acts something like a CV somewhere between the 63.2:36.8 set ratio and the 100:36.8 bag ratio):
Improvements on Cross-Validation: The .632+ Bootstrap Method Bradley Efron; Robert Tibshirani Journal of the American Statistical Association, Vol. 92, No. 438. (Jun., 1997), pp. 548-560.
At the other extreme, and appropriate when you have minimal data, is LOO where we "leave one out" for testing and do these for each of N-instance (and PRESS can back out an example to do this cheaply for linear methods). See the many discussions of the bias-variance tradeoff, which lead to general acceptance of 5-CV or 10-CV even 20-CV or some number of repetitions o these (probably not more than 10). The R repetitions try different random K-CV partitionings, for RxK-CV, and there is a limit to how much more significance can be squeezed from the data. Generally we will be taking statistics of RK results, and standard errors and apparent significance will reduce with √RK. So stddev is a factor of 10 smaller for 10x20-CV vs 2-CV, but this is about the limit before we enter fantasyland. One technique is to do significance tests as you go and use this reduction to see when significance is or is not likely be reached by further reductions. Here is a paper that shows that this approach can halve the amount of training you do vs 10x20-CV (so in fact 10x10-CV is about right on average):
Powers, D.M.W.; Atyabi, A, "The Problem of Cross-Validation: Averaging and Bias, Repetition and Significance," Engineering and Technology (S-CET), 2012 Spring Congress on , vol., no., pp.1,5, 27-30 May 2012
The other thing is to watch your learning and validation curves as you try different ratios validation ratios - if the learning curve is leveling out then you can afford to use a more generous validation set, but if not you should use a larger training fold to reduce bias even though you will see high variance as difficult, erroneous or noisy examples are popped in an out of the validation fold (whether it makes a big difference in the training fold depends on the stability of your learning algorithm). It can also be useful to keep probabilities, credibilities, ratios or ranks and plot ROC or LIFT or COST curves. For ROC and LIFT in the absence of specific cost information (viz. assuming getting all negatives right has the same value as getting all positives right), you want to use the operating point which has TPR (Sensitivity or Recall) as far above FPR (Fallout or 1-InverseRecall or 1-Specificity) as possible (the chance diagonal is TPR=FPR in ROC and FPR as Recall of the wrong class can also be plotted in LIFT charts). Looking at these curves can give you a good idea of the variance as well as the sensitivity to choice of a particular classifier. Plotting the CV repetitions as well as the (aggregate) classifier trained on the whole data set test on itself (resubstitution) also gives you an idea of the resubstitution bias. It is also important to understand that ROC AUC captures two concepts in one value - how good the learner is and how sensitive the solution is to the precise parameterization:
Powers, D.M.W. "ROC-ConCert: ROC-Based Measurement of Consistency and Certainty", Engineering and Technology (S-CET), 2012 Spring Congress on, 2:238-241
Data ROC-ConCert ROC-based measurement of Consistency and Certainty
Conference Paper The Problem of Cross-Validation: Averaging and Bias, Repetit...
The best option is randomly, 60% of training data and the other 40% for validation. You should repeat this procedure taking randomly different sets of 60% training data and use the remaining 40% to validate the robustness of your architecture.
I think the best option is 60% for training data, 20% for testing data and other 20% for validation which are determined randomly. For more information, you can study my papers.
Small data sets (< some hundred samples ): Cross-validation or Bootstrap (be aware that structure selection is also included in the training data set!)
Large data sets: Cross-validation, Bootstrap or any partition, as e.g. 60/20/20
For very small dataset, leave-one out training may be used. Though this estimator's variance is high, researchers publish comparative studies using this performance measure on specific domains (e.g. facial action unit detection).
For large dataset, don't forget to perform n-fold cross-valdation by "rotating" the training subset.
For several applications (e.g. character recognition), you can alse increase the size of your training set by applying random or domain-specific transformations (rotation, streching, thinning, etc) to original samples in order to synthesize new samples.
The total data can be divided equally or slightly higher number in the training set (50:50 or 60:40). The division can be done randomly or some data splitting algorithm. Some popular algorithms are Kennard-Stone's CADEX algorithm, DUPLEX, Kohenen SOM, SPXY. More you can get in research articles
What I faced was training set matrix may reach 1000 X 200. I think the key here is the way the training data was pre-processed. As all know, more data does not mean more accuracy. The training matrix was reduced using statistical means to 2 X 200 and used to train the ANN. I used the same concept to reduce the testing dataset
I was very satisfied with results when they compared by human being classification. If your data can be pre-processed in same way, I can help.
There's no best answer for this question because the performance depends so munch on the dataset. You can also refer to Prechelt(1994), however the common practice is to try data allocation 80:10:10; 70:15:15 or 60:20:20 (training:validation:testing). For some image processing problem, in my experience 70:15:15 yields the best training and test results.
If we are looking to verify the performance, I think that K-fold cross validation (usually with 5 or 10 folds) is a good way to assess any machine learning approach with a given dataset.
My thought is that a "simple" split training-validation-test would yield one result , and the order of the data (the way that you divide the dataset) definitly has impact in the performance, meaning that changing the order of data and performing the same split can give different results. Since cross-validation divides the dataset in folds, training and testing with different parts of the dataset, the machine learning approach will be assessed multiple times and by computing the average performance results, one can assess the overall performance and generalization power multiple times which gives us a better understanding of the machine learning approach with that dataset.
From my point of view, the limitations of cross validation are: (i) time - if you have a long training/test phase, you will be doing it 5 or 10 or K times and it could be costly ; (ii) unbalanced dataset/folds - although, it can be worked around, having an unbalanced dataset can yield unbalanced folds, which can lead to overfitting models in some folds ; (iii) small dataset size - almost related with the same problem in (ii), small dataset size will lead to small dataset folds where, depending on the approach, could not be enough to train the classifier properly and that will reflect in the performance.
If you have enough data to spare half for testing without compromising learning, then you can use 50:50 or even 2-CV or 5x2CV (a repeated cross-validation which used to be recommended following a paper by Dietterich). The CV approaches can be combined with statistical tests (see Dietterich (1997-8): t-test, Apaydin (1998-9): F-test). Another approach is Bootstrapping with the state of the art being .632+ boostrapping (and simple .632 boostrapping is something like 70:30 selection except that we are selecting randomly with replacement so we can get duplicates and so get a full size training set and about 36.8% left out by chance - thus it acts something like a CV somewhere between the 63.2:36.8 set ratio and the 100:36.8 bag ratio):
Improvements on Cross-Validation: The .632+ Bootstrap Method Bradley Efron; Robert Tibshirani Journal of the American Statistical Association, Vol. 92, No. 438. (Jun., 1997), pp. 548-560.
At the other extreme, and appropriate when you have minimal data, is LOO where we "leave one out" for testing and do these for each of N-instance (and PRESS can back out an example to do this cheaply for linear methods). See the many discussions of the bias-variance tradeoff, which lead to general acceptance of 5-CV or 10-CV even 20-CV or some number of repetitions o these (probably not more than 10). The R repetitions try different random K-CV partitionings, for RxK-CV, and there is a limit to how much more significance can be squeezed from the data. Generally we will be taking statistics of RK results, and standard errors and apparent significance will reduce with √RK. So stddev is a factor of 10 smaller for 10x20-CV vs 2-CV, but this is about the limit before we enter fantasyland. One technique is to do significance tests as you go and use this reduction to see when significance is or is not likely be reached by further reductions. Here is a paper that shows that this approach can halve the amount of training you do vs 10x20-CV (so in fact 10x10-CV is about right on average):
Powers, D.M.W.; Atyabi, A, "The Problem of Cross-Validation: Averaging and Bias, Repetition and Significance," Engineering and Technology (S-CET), 2012 Spring Congress on , vol., no., pp.1,5, 27-30 May 2012
The other thing is to watch your learning and validation curves as you try different ratios validation ratios - if the learning curve is leveling out then you can afford to use a more generous validation set, but if not you should use a larger training fold to reduce bias even though you will see high variance as difficult, erroneous or noisy examples are popped in an out of the validation fold (whether it makes a big difference in the training fold depends on the stability of your learning algorithm). It can also be useful to keep probabilities, credibilities, ratios or ranks and plot ROC or LIFT or COST curves. For ROC and LIFT in the absence of specific cost information (viz. assuming getting all negatives right has the same value as getting all positives right), you want to use the operating point which has TPR (Sensitivity or Recall) as far above FPR (Fallout or 1-InverseRecall or 1-Specificity) as possible (the chance diagonal is TPR=FPR in ROC and FPR as Recall of the wrong class can also be plotted in LIFT charts). Looking at these curves can give you a good idea of the variance as well as the sensitivity to choice of a particular classifier. Plotting the CV repetitions as well as the (aggregate) classifier trained on the whole data set test on itself (resubstitution) also gives you an idea of the resubstitution bias. It is also important to understand that ROC AUC captures two concepts in one value - how good the learner is and how sensitive the solution is to the precise parameterization:
Powers, D.M.W. "ROC-ConCert: ROC-Based Measurement of Consistency and Certainty", Engineering and Technology (S-CET), 2012 Spring Congress on, 2:238-241
Data ROC-ConCert ROC-based measurement of Consistency and Certainty
Conference Paper The Problem of Cross-Validation: Averaging and Bias, Repetit...
I like to extend Prof Powers' respond by adding some more information about the complications that might arise from using CV and how Efron's approach address these issues.
Cross-Validation (CV) is one of the most commonly used approaches for estimation of generalization error which is designed based on randomly dividing the sample set to some number of fixed size partitions/folds. The estimation of the classification performance is performed by all but one fold being used for training the classifiers and the remaining fold that was not used for training the classifiers being used as the unseen testing data for estimation and prediction of generalize-ability.
The process is continued by selecting a different fold as the test data and the remaining folds as the training data. This guarantees that eventually all folds being used as the testing data once. To provide better estimation of generalization error this process is repeated several times. Assuming r repetition with n fold CV, averaging the estimation of prediction error across the n folds and r repetitions provides the required prediction error estimation.
CV suffers from high variability of the results and its final predictions tend to have bias toward the upper boundaries. Efron proposed two alternative paradigms known as .632 and .632+ bootstrap based on providing average estimation over 50*m bootstrap of the data with m being the number of samples in the dataset. The suggested paradigms addressed the high variability and having bias toward the upper boundaries problems of CV but their results tend to have a bias toward the lower boundaries. Jing and Simon and Fu, Carroll, and Wang proposed combinations of CV and bootstrap as a better alternative. Although the proposed approaches tend to address the shortcomings of the CV paradigm for error estimation on unseen data, it is noteworthy that they are both computationally intensive.
Efron, B. and Tibshirani, R.,Improvements on Cross-Validation: the 0.632+ Bootstrap Method,Journal of American Statistical association,Vol. 92,No. 438,pp. 548 - 560,1997.
Jiang, W. and Simon, R., A comparison of bootstrap methods and an adjusted bootstrap approach for estimating prediction error in micro-array classification, Biometric Research Branch, National Cancer Institute, National institute of Health, USA,pp. 1 - 26,2007.
Fu, W.J. and Carroll, R.J. and Wang, S., Estimating misclassification error with small samples via bootstrap cross-validation, Bioinformatics,Vol. 21, No. 9,pp. 1979 - 1986,2005.
Personally if your test dataset is around 15-20% of your training data will yield good results. In other word, use 80-85% of data for training and 20-15% for test data. Other issue you should consider is the number of training record and try to use feedback to adjust training data based on experts opinions, and when satisfied then use the test datasets. My training data were around 1500 records. You may refer to my papers on the topic