Cross validation is a method applied to a model and a data set in an effort to estimate the out of sample error. It has become quite popular because of it's simplicity and utility; there is even a popular statistics message board named after the method (cross validated: stats.stackexchange.com).
When we fit a model to a data set, we do so by minimizing some sort of loss function; most often, we will use the squared error loss function for simplicity. It is well known, and should be quite obvious, that estimating the resulting prediction error by using the same data that we used to fit the model will produce overly optimistic results. Therefore, it is common practice to test the model on a new data set to provide a better estimation of the out of sample prediction error. However, when data collection is cost prohibitive, we may prefer not to "throw away" a significant portion of out data in a test data set. In this case me may turn to k-fold cross validation, the mot popular flavor of which being 10-fold cross validation. In k-fold cross validation, the data set is split randomly into k partitions, We then fit our model to a data set consisting of k-1 of the original k parts, and use the remaining portion for validation. That is we estimate the out of sample error using the portion of data left out of the fitting procedure. We repeat this k times and our estimate for the out of sample error is the the average over the k validation runs.
There are many excellent references available. The following book by Hastie et. al is quite good and freely available online, it has a short section on cross validation that may be of some interest.
Hastie, T., & Tibshirani, R. (2005). The elements of statistical learning: data mining, inference and prediction. The Mathematical … (Second Edi.). Springer. Retrieved from http://www.springerlink.com/index/D7X7KX6772HQ2135.pdf
Cross validation is a method applied to a model and a data set in an effort to estimate the out of sample error. It has become quite popular because of it's simplicity and utility; there is even a popular statistics message board named after the method (cross validated: stats.stackexchange.com).
When we fit a model to a data set, we do so by minimizing some sort of loss function; most often, we will use the squared error loss function for simplicity. It is well known, and should be quite obvious, that estimating the resulting prediction error by using the same data that we used to fit the model will produce overly optimistic results. Therefore, it is common practice to test the model on a new data set to provide a better estimation of the out of sample prediction error. However, when data collection is cost prohibitive, we may prefer not to "throw away" a significant portion of out data in a test data set. In this case me may turn to k-fold cross validation, the mot popular flavor of which being 10-fold cross validation. In k-fold cross validation, the data set is split randomly into k partitions, We then fit our model to a data set consisting of k-1 of the original k parts, and use the remaining portion for validation. That is we estimate the out of sample error using the portion of data left out of the fitting procedure. We repeat this k times and our estimate for the out of sample error is the the average over the k validation runs.
There are many excellent references available. The following book by Hastie et. al is quite good and freely available online, it has a short section on cross validation that may be of some interest.
Hastie, T., & Tibshirani, R. (2005). The elements of statistical learning: data mining, inference and prediction. The Mathematical … (Second Edi.). Springer. Retrieved from http://www.springerlink.com/index/D7X7KX6772HQ2135.pdf
Your information is really informative. Now I have a doubt in this. I understood the reason behind k-fold or 10-fold cross validation. What is the need of going for 10-times 10-fold cross validation? What is the actual procedure to do so? Is it simply running the 10-fold cross validation method for 10 times? That means the loop should run in total 100 times and finally the average of all the 100 output??
10-fold cross validation would perform the fitting procedure a total of ten times, with each fit being performed on a training set consisting of 90% of the total training set selected at random, with the remaining 10% used as a hold out set for validation.
Thanks Shane. But my doubt is still not clear. I can understand 10-fold cross validation as you have explained. But what is 10 times of 10-fold cross validation? Is it doing 10-fold cross validation for 10 times? That means do we read to do a total 100 times(10X10fold) of cross validation.