How many ways exist to separate train and test data sets in ANN? which one is the best?

08 August 2014 20 2K Report

I want to know which way is the best to separate data set to train and test? random? cross validation?

David Martin Ward Powers Popular answer

Generally, it depends on how much data you have.

If you have enough data to spare half for testing without compromising learning, then you can use 50:50 or even 2-CV or 5x2CV (a repeated cross-validation which used to be recommended following a paper by Dietterich). The CV approaches can be combined with statistical tests (see Dietterich (1997-8): t-test, Apaydin (1998-9): F-test). Another approach is Bootstrapping with the state of the art being .632+ boostrapping (and simple .632 boostrapping is something like 70:30 selection except that we are selecting randomly with replacement so we can get duplicates and so get a full size training set and about 36.8% left out by chance - thus it acts something like a CV somewhere between the 63.2:36.8 set ratio and the 100:36.8 bag ratio):

Improvements on Cross-Validation: The .632+ Bootstrap Method Bradley Efron; Robert Tibshirani Journal of the American Statistical Association, Vol. 92, No. 438. (Jun., 1997), pp. 548-560.

At the other extreme, and appropriate when you have minimal data, is LOO where we "leave one out" for testing and do these for each of N-instance (and PRESS can back out an example to do this cheaply for linear methods). See the many discussions of the bias-variance tradeoff, which lead to general acceptance of 5-CV or 10-CV even 20-CV or some number of repetitions o these (probably not more than 10). The R repetitions try different random K-CV partitionings, for RxK-CV, and there is a limit to how much more significance can be squeezed from the data. Generally we will be taking statistics of RK results, and standard errors and apparent significance will reduce with √RK. So stddev is a factor of 10 smaller for 10x20-CV vs 2-CV, but this is about the limit before we enter fantasyland. One technique is to do significance tests as you go and use this reduction to see when significance is or is not likely be reached by further reductions. Here is a paper that shows that this approach can halve the amount of training you do vs 10x20-CV (so in fact 10x10-CV is about right on average):

Powers, D.M.W.; Atyabi, A, "The Problem of Cross-Validation: Averaging and Bias, Repetition and Significance," Engineering and Technology (S-CET), 2012 Spring Congress on , vol., no., pp.1,5, 27-30 May 2012

The other thing is to watch your learning and validation curves as you try different ratios validation ratios - if the learning curve is leveling out then you can afford to use a more generous validation set, but if not you should use a larger training fold to reduce bias even though you will see high variance as difficult, erroneous or noisy examples are popped in an out of the validation fold (whether it makes a big difference in the training fold depends on the stability of your learning algorithm). It can also be useful to keep probabilities, credibilities, ratios or ranks and plot ROC or LIFT or COST curves. For ROC and LIFT in the absence of specific cost information (viz. assuming getting all negatives right has the same value as getting all positives right), you want to use the operating point which has TPR (Sensitivity or Recall) as far above FPR (Fallout or 1-InverseRecall or 1-Specificity) as possible (the chance diagonal is TPR=FPR in ROC and FPR as Recall of the wrong class can also be plotted in LIFT charts). Looking at these curves can give you a good idea of the variance as well as the sensitivity to choice of a particular classifier. Plotting the CV repetitions as well as the (aggregate) classifier trained on the whole data set test on itself (resubstitution) also gives you an idea of the resubstitution bias. It is also important to understand that ROC AUC captures two concepts in one value - how good the learner is and how sensitive the solution is to the precise parameterization:

Powers, D.M.W. "ROC-ConCert: ROC-Based Measurement of Consistency and Certainty", Engineering and Technology (S-CET), 2012 Spring Congress on, 2:238-241

Data ROC-ConCert ROC-based measurement of Consistency and Certainty

Conference Paper The Problem of Cross-Validation: Averaging and Bias, Repetit...

Damian Campo

The best option is randomly, 60% of training data and the other 40% for validation. You should repeat this procedure taking randomly different sets of 60% training data and use the remaining 40% to validate the robustness of your architecture.

Reza Zabihi

I think the best option is 60% for training data, 20% for testing data and other 20% for validation which are determined randomly. For more information, you can study my papers.

Sara Mccaslin

The 60/20/20 was also what I was taught in college.

Ralf Mikut

Small data sets (< some hundred samples ): Cross-validation or Bootstrap (be aware that structure selection is also included in the training data set!)

Large data sets: Cross-validation, Bootstrap or any partition, as e.g. 60/20/20

Lionel Prevost

For very small dataset, leave-one out training may be used. Though this estimator's variance is high, researchers publish comparative studies using this performance measure on specific domains (e.g. facial action unit detection).

For large dataset, don't forget to perform n-fold cross-valdation by "rotating" the training subset.

For several applications (e.g. character recognition), you can alse increase the size of your training set by applying random or domain-specific transformations (rotation, streching, thinning, etc) to original samples in order to synthesize new samples.

Ajaya Kumar Pani

The total data can be divided equally or slightly higher number in the training set (50:50 or 60:40). The division can be done randomly or some data splitting algorithm. Some popular algorithms are Kennard-Stone's CADEX algorithm, DUPLEX, Kohenen SOM, SPXY. More you can get in research articles

Yahya A Asiri

Hi Guys and greetings from Saudi Arabia

What I faced was training set matrix may reach 1000 X 200. I think the key here is the way the training data was pre-processed. As all know, more data does not mean more accuracy. The training matrix was reduced using statistical means to 2 X 200 and used to train the ANN. I used the same concept to reduce the testing dataset

I was very satisfied with results when they compared by human being classification. If your data can be pre-processed in same way, I can help.

Yahya A Asiri

If you ask about percentage, I used 100% for training, 0% testing and 4% new blind data for validation.

Again, this was based on the way I selected to pre-process the training data.

Fadzilah Siraj

There's no best answer for this question because the performance depends so munch on the dataset. You can also refer to Prechelt(1994), however the common practice is to try data allocation 80:10:10; 70:15:15 or 60:20:20 (training:validation:testing). For some image processing problem, in my experience 70:15:15 yields the best training and test results.

Khaled Belouz

There isn't a universal rule to separate dataset, the sole way is to use trial-and-error, step-by-step method, but some researcher use

João Nuno Correia

If we are looking to verify the performance, I think that K-fold cross validation (usually with 5 or 10 folds) is a good way to assess any machine learning approach with a given dataset.

My thought is that a "simple" split training-validation-test would yield one result , and the order of the data (the way that you divide the dataset) definitly has impact in the performance, meaning that changing the order of data and performing the same split can give different results. Since cross-validation divides the dataset in folds, training and testing with different parts of the dataset, the machine learning approach will be assessed multiple times and by computing the average performance results, one can assess the overall performance and generalization power multiple times which gives us a better understanding of the machine learning approach with that dataset.

From my point of view, the limitations of cross validation are: (i) time - if you have a long training/test phase, you will be doing it 5 or 10 or K times and it could be costly ; (ii) unbalanced dataset/folds - although, it can be worked around, having an unbalanced dataset can yield unbalanced folds, which can lead to overfitting models in some folds ; (iii) small dataset size - almost related with the same problem in (ii), small dataset size will lead to small dataset folds where, depending on the approach, could not be enough to train the classifier properly and that will reflect in the performance.

Arslan Ahmad

It's better to separate your data into 30:70 for testing and training dataset respectively

David Martin Ward Powers

Generally, it depends on how much data you have.

Improvements on Cross-Validation: The .632+ Bootstrap Method Bradley Efron; Robert Tibshirani Journal of the American Statistical Association, Vol. 92, No. 438. (Jun., 1997), pp. 548-560.

Powers, D.M.W. "ROC-ConCert: ROC-Based Measurement of Consistency and Certainty", Engineering and Technology (S-CET), 2012 Spring Congress on, 2:238-241

Data ROC-ConCert ROC-based measurement of Consistency and Certainty

Conference Paper The Problem of Cross-Validation: Averaging and Bias, Repetit...

Dr. Indrajit Mandal

hello friend

Apart from the above answers, I would like to add little salt to the discussion.

There are mainly few methods to do the separation:

1. k fold cross validation.

2. 2/3 + 1/3 training and testing

3. Two independent train and test set.

You just look into my papers.

all are implemented there.

All the best.

Thank you.

Best,

Dr.Indrajit Mandal

Adham Atyabi

I like to extend Prof Powers' respond by adding some more information about the complications that might arise from using CV and how Efron's approach address these issues.

Cross-Validation (CV) is one of the most commonly used approaches for estimation of generalization error which is designed based on randomly dividing the sample set to some number of fixed size partitions/folds. The estimation of the classification performance is performed by all but one fold being used for training the classifiers and the remaining fold that was not used for training the classifiers being used as the unseen testing data for estimation and prediction of generalize-ability.

The process is continued by selecting a different fold as the test data and the remaining folds as the training data. This guarantees that eventually all folds being used as the testing data once. To provide better estimation of generalization error this process is repeated several times. Assuming r repetition with n fold CV, averaging the estimation of prediction error across the n folds and r repetitions provides the required prediction error estimation.

CV suffers from high variability of the results and its final predictions tend to have bias toward the upper boundaries. Efron proposed two alternative paradigms known as .632 and .632+ bootstrap based on providing average estimation over 50*m bootstrap of the data with m being the number of samples in the dataset. The suggested paradigms addressed the high variability and having bias toward the upper boundaries problems of CV but their results tend to have a bias toward the lower boundaries. Jing and Simon and Fu, Carroll, and Wang proposed combinations of CV and bootstrap as a better alternative. Although the proposed approaches tend to address the shortcomings of the CV paradigm for error estimation on unseen data, it is noteworthy that they are both computationally intensive.

Efron, B. and Tibshirani, R.,Improvements on Cross-Validation: the 0.632+ Bootstrap Method,Journal of American Statistical association,Vol. 92,No. 438,pp. 548 - 560,1997.

Jiang, W. and Simon, R., A comparison of bootstrap methods and an adjusted bootstrap approach for estimating prediction error in micro-array classification, Biometric Research Branch, National Cancer Institute, National institute of Health, USA,pp. 1 - 26,2007.

Fu, W.J. and Carroll, R.J. and Wang, S., Estimating misclassification error with small samples via bootstrap cross-validation, Bioinformatics,Vol. 21, No. 9,pp. 1979 - 1986,2005.

Modafar Ati

Personally if your test dataset is around 15-20% of your training data will yield good results. In other word, use 80-85% of data for training and 20-15% for test data. Other issue you should consider is the number of training record and try to use feedback to adjust training data based on experts opinions, and when satisfied then use the test datasets. My training data were around 1500 records. You may refer to my papers on the topic

Gour Sundar Mitra Thakur

Though there isn't any fixed method to do the same..but for a sufficiently large data set 80:20(Training:Testing) can work fine..

Megha Vora

In generak k-fold cross validation is good

Sanjeev Nara

Hi everyone

Is there any option to specify train test ration (i.e 80 20 ; 70 30 ... or even lesser 20 80) using K fold cross validation in MATLAB.

regards

Badges
Science topic

More Rahim Barzegar's questions See All

Why do I get a minimal improvement of wavelet-AI models in comparison with single AI models?

I have used ANN, ANFIS, wavelet-ANN and wavelet-ANFIS for water quality prediction. I have used highly correlated variables as inputs (Pearson correlation>0.5). The R2 for ANN, ANFIS, best...

08 September 2015 2,553 4 View

How can I solve this warning in Fuzzy logic?

I develop fuzzy logic (sugeno and mamdani) with 4 inputs and 1 output. It is the message of the model "Warning: Some input values are outside of the specified input range." How can solve this...

10 November 2014 8,640 2 View

Is my fitness function correct to find the minimum of the function?

I want to find the minimum of the function in GA to find the values of x. I created a matrix A (70*5) and wrote the function, but after running the GA, the x values are not correct. Is it (fitness...

09 October 2014 9,343 7 View

How can I solve the warning massage in ANFIS?

I use an ANFIS toolbox to predict a parameter with anfis and this error has appeared. Warning: genfis1 has created a large rulebase in the FIS. MATLAB may run out of memory if this FIS is tuned...

06 July 2014 10,062 4 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Which Scopus Journal provides the most affordable fees?

"PUBLISHING IN A SCOPUS JOURNAL" Researchers are now at a cross road. The critical need to publish in a Scopus or ISI, etc journal is ever vital. Journal Publication fees must be submitted....

10 August 2024 8,621 1 View

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

Hello everyone, I am currently developing a thesis proposal and would appreciate your input on its viability and how to effectively carry it out. My proposed topic is: "Does the perceived threat...

10 August 2024 8,992 0 View

Can we mark 'EFL Learners shifting from general digital to AI technologies' as technological transition?

After COVID-19 it has seen that EFL learners technological affiliation has raised. In addition, in the post-COVID period learners started to engage AI technologies like ChatGPT while learning...

08 August 2024 8,964 4 View

Who will be moral responsible for the death of thousands of people in the event of an earthquake?

Who will bear moral responsibility for the deaths of thousands of people in the event of an earthquake? Weeks and months remain before the onset of strong earthquakes that bring death to...

08 August 2024 6,134 12 View

What are examples of AI for good projects a teacher can assign to students?

So I am organizing an AI seminar. What are possible AI projects in the AI for good spirit? something the students can do and have an impact?

08 August 2024 9,437 4 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View

How to design human-centered classroom in the age of A.I.?

08 August 2024 347 5 View

Hello all, Looking for international reviewer to review Ph.D thesis in wireless sensor network.Can anybody help?

My name is Apurva Saoji. I am a Ph.D scholar in Computer engineering in India. I am looking for international expert in reviewing my PhD thesis, "Competitive Optimization Techniques to Minimize...

07 August 2024 4,600 2 View