What should be Optimal size of training data ?

Mohamed Sherif Zaghloul Popular answer

The training data size depends on the complexity of your model, for example, the number of inputs/outputs, the relationships between parameters, the noise in the data, the variance and standard dev of every parameter, etc.

Every problem is different, so the best approach is to make sure your data covers all the ranges you want for all parameters. Train the model and the test it, and the training performance vs testing performance will give you clues on whether you need more data. My experience is that the model will tend to overfit easily if there is not enough data.

Mohamed Sherif Zaghloul

Yousof Ebneddin Hamidi

It is like number of equations(training data) vs number of unknowns(Model Complexity). Of course because the system is nonlinear, it is not as simple as this. The above answer looks good.

Thomas Hoppe

The number of training examples is not so much an issue of the model, it is more an issue of the structure of the data and the data space. You can have a high dimensional data space and can learn a good model, if the data are cleanly separated. But you can also have a data space, where you need a enormous amount of data or switch to higher dimensions, if the data are nearly inseparatable,

In computational learning theory there exists the concept of the VC-dimension (Vapnik–Chervonenkis dimension, https://en.wikipedia.org/wiki/VC_dimension), which gives for machine learning algorithms a lower bound on the minimal number of training examples required to learn a concept approximately correctly. But the VC-dimension of an algorithm is more of theroretical interest than of practical use. For some of the standard ML algorithms the VC-dimension can be determined, but I doubt that it is available for all algorithms currently around.

In the video https://www.youtube.com/watch?v=Dc0sr0kdBVI&hd=1 on VC-Dimension Yaser Abu-Mostafa states as a rule of thumb that it is save for a large number of situations to use 10 times the number of VC dimensions training examples.

Unfortunately, there is no direct relationship between the number of features in a learning problem and the VC-dimension of the algorithm used. However, the "curse of dimensionality" tells us that the space of the data grows exponentially with the number of features and their data types. Thus, I doubt that this "rule of thumb" can be translated to the features of a learning problem.

Ioannis Tsamardinos

the short answer is infinite sample size. but see our paper Article Bootstrapping the Out-of-sample Predictions for Efficient an...

to read how to best estimate what you can do with whatever sample you have. btw, you are asking the wrong question probably. either you are given a specific sample size and there is no choice or there is the choice to produce your samples possibly with a cost, in which case you should do an active learning method.

Sayak Paul

According to my experience with it, I think this is quite domain specific and it partially depends on what kind of models you are using to sort of approximating the hypothesis.

With that being said, if you have a training dataset which suffers from class imbalance problem, curse of dimensionality etc. but it has a decent size; it will cause you trouble.

Then there are Generative Adversarial Networks which attempt to learn similar distribution of the original data and generate them for you. There are also techniques like Safe-level SMOTE which can be of use if you have a training dataset which is not decent enough to be trained.

I would like to reiterate though, all these statements are very domain specific according to my findings.

To conclude, I would like to mention as long as the training set is giving your model a good amount of training for generalizing well, it should be goto.

Saptarsi Goswami

Thanks to all of you for the answers. My question was not dataset specific or problem specific. Rather than a quest from today's high availability of data., after some point does the additional data gives any more discriminating power. I will defintely read on VC dimension. Thank you once again!

Ioannis Tsamardinos

After some point, the additional samples become redundant. What is this point? That would require statistical power analysis for multivariable (high-dimensional) analyses which is impossible in general. Thus, there is no a prior answer except for very specific cases, e.g., when sample size is much larger than the number of predictors, all predictors are discrete with a small number of values.

But, during or after the analysis you can compute confidence intervals of the performance estimation with various sample sizes and create the learning curve of the classifier on the particular data distribution. If the confidence interval is very small, extra samples won't help. If the learning curve has plateaued, extra samples won't help. Again, the paper I mentioned in my previous answer contains both techniques for computing the confidence intervals and a reference in the Bioinformatics journal for computing the learning curve.

Tahmina Zebin

-Pythons scikit-Learn library has an estimated sample size preference for some state-of the -Art algorithms

http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

Saptarsi Goswami

Thanks a lot @ Tahmina and @ Ioannis

Saleh Mousa

It totally depends on the model trained, for instance, for most common machine learning algorithms, including neural networks, decision trees, and ensemble tree algorithms with the introduction of big data, the performance of all these algorithms is expected to reach a maximum threshold no matter how much more data is included in the training. In other words, there is a point where more data may not improve the model. How much data? this totally depends on the parameters and complexity of each of the trained models as well as the variance in the data. However, when training deep neural networks, usually the more data you put the better chances you get higher accuracy, especially, if the training is performed appropriately. That is why many people prefer training deep learning applications on supercomputers.

Hubert Anysz

In "Neural Network Design" (by Hagan, Demuth, Beale, de Jesus; guys involved in developing Matlab ANN Toolbox) can be found that 70 % of data as the training set is typical (15 % for validation, 15 % for testing purposes). Polish scientists prof. Tadeusiewicz and Osowski (well known for Polish ANN users) agree that 70:15:15 or 60:20:20 are good ratios.

I appreciate Saleh Mousa answer (big data approach). What I've written above concerns the cases where we suffer from lack of data i.e. they are limited to a few hundred cases (they are definitely not big data). Then, there is a problem of accuracy of results. The number of training sets of data should be much higher than the number of weights (connections between neurons).

Too small training data set can not allow to achieve optimum number of neurons (optimum = providing the most accurate result).

Regards

Hubert

Random Matrix theory Correlation Matrix analysis

Can u pls suggest an avrg Comp Sc. journal publishing reviews with no fess and quick publication?

Feedback defines the constitution of an organism?

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

Measuring the Intelligence of a Species?

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

The Curse of Evolution and Complexity?

Need help with my research project on open source SIEM and machine learning?

Swimming/space travel depends on the proprioceptive muscle spindles?

What are the limitations and challenges of using machine learning for predicting concrete compressive strength in practical applications?

Some new emerging problems on application of RL for scheduling in IoT networks?

How to Compress Information Neurally?