Overfitting is a phenomena in data mining. Many methods are reported in the literature but not many working examples. I need some good reference on the topic.
The simplest way to avoid over-fitting is to make sure that the number of independent parameters in your fit is much smaller than the number of data points you have. By independent parameters, I mean the number of coefficients in a polynomial or the number of weights and biases in a neural network, not the number of independent variables. My rule-of-thumb is to select a form for the fit such that the number of data points is 5X to 10X the number of coefficients. If you cannot afford the luxury, you can go lower never below 2X. Simple example: If you have ten data points in a single variable, y=f(x), a 9th order polynomial will give you a perfect fit -- classic example of over-fitting. Using my rule-of-thumb, you would try to fit a quadratic or a fourth-order curve. The basic idea is that if the number of data points is ten times the number of parameters, overfitting is not possible.
The "classic" way to avoid overfitting is to divide your data sets into three groups -- a training set, a test set, and a validation set. You find the coefficients using the training set; you find the best form of the equation using the test set, test for over-fitting using the validation set. Be careful not to use the validation set until after you have picked the best form of fit. See: http://en.wikipedia.org/wiki/Test_set
I've came across such an idea: split your data into 3 mutually exclusive sets: training, test and evaluation. Build your models on the training set and fine tune them using the test set. When you're happy with your model evaluate it ONLY ONCE on the last data set. Hope it brings something new to you!
The best one for this, is the k-fold testing cross validation. Also Leave One out (LOO) approach would be the first alternative, but it needs lots of computations...try to keep k-fold testing approach by defining a minimum assumable test size...see some of my papers (e.g. Hydrol. Procss, or COMPAG papers...). keep me updated if this was not clear.
you should have institutional access to the publishers...if you do not have access, please ask the papers through RESEARCH gate so that I can send them immediately.
Split your training set into two parts (training set and validation set). Then apply the concept of early stopping, i.e., spotting the occurrence of minimum error of model on validation set on MSE vs. Epoch curve. This would overcome the overfitting problem.
We analyzed this problem for neural networks in Tetko, I.V.; Livingstone, D.J.; Luik, A.I. Neural Network Studies. 1. Comparison of Overfitting and Overtraining. J. Chem. Inf. Comput. Sci. 1995, 35(5), 826-833. Since that time we use splitting on 3 sets: training , validation set (on which performance of model is observed during the training) and test sets. The last one is used to measure the actual prediction accuracy of the models (e.g., 5-fold cross-validation). The training/validation set are of the same size.
Also be careful with the complexity of the system (for example, the number of nodes in an ANN). Too many parameters lead to overfitting (more parameters to adjust than data in the training-set). Try to get the minimum ANN architecture to solve the problem.
I am grateful to all of you for your answer. I am working on a small monograph to identify the over fitting and under fitting in data mining in general not specific to a model such as ANN. I was more interested in working on data rather than methodology.
To mention if you fit a logistics regression with 20 predictor and one target variable what one need to see in data to avoid over fitting & under fitting while using standard method of division of data in thee parts (test, training & validation) and also employing cross validation.
I am reading the data provided by you & get back you with some write up to get your feedback.
Instead of carrying out training and testing, you do training and validation before the testing. Over-fitting will be moderated during the validation stage. It does mean that you need to split the total dataset into three non-overlapping datasets. Good luck.
There are some methods used to estimate the level of noise in the data known as gamma- and -delta tests, which do not require data splitting in principle. They are similar to estimating the nugget of the variogram in geostatistics. You can stop the learning of the ANN just before the training error goes below the noise level (estimated a-priori using the gamma or delta test). Using a similar principle some authors (I guess Grandvalet and others) propose injecting dynamic noise in the data during the training process of ANN.
If you split the data into training, validation, test or employ cross-validation techniques be carefu with data that are serially correlated (ex: time), too much clustered or pseudo-replicated. It would be too easy for the model to predict a test sample that is too close to the training sample (in the space of features). This may lead to the underestimation of true prediction errors. I had some issues of this type when analysing geophysical data with ANN and SVM classification/regression methods.
George Corliss summarized the recommendations I was going to make better than I could. In particular, if the size of your data set is much larger than the number of coefficients in your regression (my rule of thumb is >10X), overfitting is not a problem.
Early stopping as Ka-Chun suggests is one form of regularization that was popular in the early history of neural networks, but it doesn't work in all cases, nor in all kinds of models or regularization. L1 regularization, in particular, is not easily implemented this way.
I woud just like to give some additional ideas for preventing overfitting:
1 - If you are using some iterative method (em, decision trees), it would be an excellent idea to draw the performance gain graph after each iteration. Overfitting will manifest itself as a trend of quadratic decrease in the performance gain. (if your gain graph is becoming a horiontal line before iterations end it is a clear signal for overfitting)
2 - If you are working with methods using principal curves, planes (svm) and your dataset is small, it could be a good idea to use your training set as a valiadation test. Oddly, for me using the training set as validation set usually worked as an indicator for finding overfitting.
3 - increasing the dataset size usually delay overfitting, so it may work for some problems. However, using oversampling for increasing the dataset size wont work, just so beware.
4 - Try validating the tf-idf scores of your dataset, and if theres not enough discrimiative features, you might want to try and find new features with more discriminative power. Or you might do the inverse and use a more strict feature-reduction. Thats just an idea, i didnt try this in any of my works so far to prevent overfitting, but i think it will work to an extent since discriminative features actually generate larger distances between data instances, thus overfitting should have less effect on the performance.
I agree that splitting your dataset into multiple subsets as suggested by others is a good way to evaluate the robustness of the predictive model. Many factors contribute to overfitting (e.g. too many descriptors, learning parameters, applicability domain, etc.).
Comparing the performance afforded by training set and the validation sets is also a good way to diagnose for potential overfitting (particularly if the difference between the performance of training set and validation set is quite large thus you would want to see a predictive model that afford similar level of performance for both training and validation sets).
Rigorously testing your predictive model against several external sets (i.e. positive controls, negative controls, decoys, subsets of data prospectively from different time frame as those used in the training set).
Aside from the above comments, I also suggest some references:
Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation
http://www.jcheminf.com/content/6/1/47
Oversampling to Overcome Overfitting: Exploring the Relationship between Data Set Composition, Molecular Descriptors, and Predictive Modeling Methods
http://pubs.acs.org/doi/abs/10.1021/ci4000536
Critical Assessment of QSAR Models of Environmental Toxicity against Tetrahymena pyriformis: Focusing on Applicability Domain and Overfitting by Variable Selection
The simplest way to avoid over-fitting is to make sure that the number of independent parameters in your fit is much smaller than the number of data points you have. By independent parameters, I mean the number of coefficients in a polynomial or the number of weights and biases in a neural network, not the number of independent variables. My rule-of-thumb is to select a form for the fit such that the number of data points is 5X to 10X the number of coefficients. If you cannot afford the luxury, you can go lower never below 2X. Simple example: If you have ten data points in a single variable, y=f(x), a 9th order polynomial will give you a perfect fit -- classic example of over-fitting. Using my rule-of-thumb, you would try to fit a quadratic or a fourth-order curve. The basic idea is that if the number of data points is ten times the number of parameters, overfitting is not possible.
The "classic" way to avoid overfitting is to divide your data sets into three groups -- a training set, a test set, and a validation set. You find the coefficients using the training set; you find the best form of the equation using the test set, test for over-fitting using the validation set. Be careful not to use the validation set until after you have picked the best form of fit. See: http://en.wikipedia.org/wiki/Test_set
Perhaps not the best of ideas but, introducing some noisy data into your training set may reduce the chances of over-fitting. The difficulty I see in this technique is that of figuring out how much nose is enough to avoid over-fitting. see: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2771718/
Actually Tshilidzi's idea of introducing noise is a really excellent one. If you extend it to the concept of not only adding noise, but also adding small transformations of the inputs, you have one of the more important methods of getting really excellent models out of simple learning algorithms.
Check out this paper for an example: http://arxiv.org/pdf/1003.0358.pdf
This method of training on deformations of the input was able to set a record (at the time) for isolated hand-written digit recognition.
To avoid overfitting, just change the learning set on each analysis. Overfitting is simply caused by repeated feed-back of results into the same dataset.
This is well known fact. The reason why analysts overlook this fact I guess is that it helps them to arrive to some conclusions with the hope that they pass test. .
I studied this problem for neural networks in a recent paper:
Article Strategies to develop robust neural network models: Predicti...
As a rule of thumb, what @Shreeder Adibhatla suggested seems to work excellent. However, I suggested a more robust strategy for this problem which can be used for other fitting problems too. Briefly stated, you divide your dataset to training, validation and test sets. Then for each initially assigned parameters, after optimization, use the same initial parameters and repeat the training for another randomly divided sets several times. The average of the results of repeats can be considered as the real and authentic performance of that model and set of initially assigned parameters, otherwise it can be affected by a lucky or unlucky dataset division. Then the two sample t-test method is applied to compare the errors of training and test sets. The models for which in majority of repeats the errors for these two sets are not significantly different within therequired certainty level can be considered as efficiently trained models (low risk of overfitting). The Matlab code I used for this study with detailed comments can be obtained via contacting me ([email protected])
Overfitting could be avoided by using the hyperparameters to create the simpler models; For instance, L1, L2 for regression problems, pruning for tree based models and drop out layers in case of neural networks.
Different techniques you can use to avoid over fitting such as :
(1) Add dropout layers
(2) Use Data Augmentation
(3) Use Regularization
(4) Use architectures that generalize well by reducing architecture complexity
(5) Add more data samples
(6) You can also use BN (Batch normalization) layer because batch normalization regularizes the model. Regularization reduce the over-fitting problem and leads to better test performance through better generalization .
The overfitting is simply the direct consequence of considering the statistical parameters, and therefore the results obtained, as a useful information without checking that them was not obtained in a random way. Therefore, in order to estimate the presence of overfitting we have to use the algorithm on a database equivalent to the real one but with randomly generated values, repeating this operation many times we can estimate the probability of obtaining equal or better results in a random way. If this probability is high, we are most likely in an overfitting situation. For example, the probability that a fourth-degree polynomial has a correlation of 1 with 5 random points on a plane is 100%, so this correlation is useless and we are in an overfitting situation.