I am using bootstrapping for increasing the size of my data. I have read that bootstrapping is prone to overfitting. So, I wanted to know if there is any way by which overfitting could be avoided or minimized while using bootstrapping
I am concerned when you say "using bootstrapping for increasing the size of my data." I can not, at the moment, think of a proper application of a bootstrap procedure which could be accurately described as such.
Can you describe the procedure you are using in more detail?
As you know, Bootstrapping is a statistical technique used for resampling the existing data. I am using bootstrp() function from the statistical toolbox to achieve this. Currently, I am having very few samples for a particular datatype and those samples are insufficient for solving my current problem.(i.e classification). So I am trying to use bootstrapping for increasing the size of the my data by using the original data, but I have read that using bootstrapping the resampled data tends to overfit especially for classification task. So I wanted to know if there is any way by which I can either avoid overfitting or keep it to the minimum as possible. Currently i am using @mean in the bootstrp() function for resampling the data.
I share Shane's concern over your (repeated) statement about bootstrapping increasing the size of your sample. That is not what bootstrapping does. It draws repeated samples from your existing data. I suggest you read the attached article for a better understanding of the concept.
Ariel
Article Evaluating Disease Management Program Effectiveness: An Intr...
There are no methods available to magically increase your sample size; as Ariel says, that is simply not what the bootstrap methods do. I am still unsure exactly what you're trying to do. From you're statements, my best guess is that you're trying to fit a model to synthetic data generated from a parametric bootstrap. Such a procedure could certainly lead to overfitting the data. The best way to avoid over fitting in that case is to stop abusing the bootstrap that manner.
I can at this point only recommend some reading material. The books by Hastie and Gareth offer an excellent introduction to Statistical Learning/Machine Learning. They are also freely available online and contain examples of proper use of bootstrap methods. However, their main focus is not bootstrap methods per se and the sections are only a (good) introduction. The third book by Chernick provides a good and thorough treatment of the subject. If you anticipate using bootstraps methods extensively it would be worth reading. The book is however quite expensive.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2006). An Introduction to Stastical Learning. Design (Vol. 102). Springer. http://doi.org/10.1016/j.peva.2007.06.006
Hastie, T., Tibshirani, R., & Friedman, J. (2005). The elements of statistical learning: data mining, inference and prediction. The Mathematical … (Second Edi, Vol. 27). Springer. http://doi.org/10.1007/b94608
Chernick, M. R. (2011). Bootstrap methods (Vol. 619). John Wiley & Sons. http://doi.org/9780471756217
Atish, Please try to understand Shane's valuable advise about what bootstrap is and is not. NOTHING can give you more information than is in the data (except to gather more raw data). Bootstrap is useful to measure the intrinsic reliability of the statistical measures gleaned from that data.
Atish, I must say I completely agree with Drs McMahon, Linden and Bhavsar here. I appreciate your frustration that neither of us seems to answer your question.
I am not sure if you are keen on a lecture in epistemology, but there is no mathematical method in existence that can reliably defend itself against overfitting. Math does take abuse silently, and a mathematician has to step in eventually (despite what AI courses and science fiction books may have led you to believe).
I am not sure if this will answer your question, but to avoid overfitting, I would suggest to build your model based on the theoretical background of your phenomenon (even if you make the theory up yourself), rather then pulling it up from your data. Then you can use bootstrapping to improve the ability of your data to validate or falsify your model -- and through it, your theory.
Build enough interesting theories, so that you can select a few that agree with the data the best, and then, out of these best few, choose the simplest one. Hope this helps.
I am agree with the previous comment. Bootstrap is just one of the approach to produce the standard error and stabilize the standard error. So, this resampling is necessarry so that the estimate of standard error can really converged.
Thanks Shane, Ariel, Suketu, Oleksiy and Wan for your prompt reply. Shane you were right, i was trying to build a model. In my case, I have an unbalanced data and from what I read, bootstrap helps us to create a synthetic data which I thought will be quite useful in balancing my data. Balanced data is very useful for training model for classifiers like SVM. Thank you all for your help
It is possible that want you need is an imputation method rather than bootstrapping. It is possible to impute balanced data sets from unbalanced ones (under certain conditions). Alternatively it may be sensible to look for an alternative to SVM that doesn't require balance (I don't know enough to advise on that).