I am getting an accuracy of 88 % using naive bayes and decision tree, but when i do K fold cross validation, its reduced to 66%. How to train my system in a more efficient way?
What do you mean when you say that you get 88% accuracy using naive bayes and decision trees? Is that accuracy computed when applying the models over the same data you used to create them? If so, it is to be expected for the accuracy to be reduced when using cross validation.
a) when you test a model over the same data you used to create the model, you have overfitting (https://en.wikipedia.org/wiki/Overfitting)
b) when you use cross-validation, you expect to have less overfitting (https://tinyurl.com/yc43ae3l) but also a more realistic proxy for the accuracy
My advice is not to think of the 88% as the benchmark for the accuracy of your models, as the number reflects overfitting. K-fold cross validation is not decreasing your accuracy, it is rather giving you a better approximation for that accuracy, including less overfitting. In other words, the accuracy of your models is (approximately) 66%.
If you want to improve the accuracy, think of improving accuracy computing with k-fold cross validation. There are several ways to try to improve, for example:
a) get more data/better data
b) try other classifiers - svm, random forest, etc
c) try which combinations of features seems to work best
You can also use split method for checking your accuracy.
The whole dataset can be divided into two parts. One is training set (which may include 80% of total data), and the rest of the data set is used as testing set. After training your model with training dataset, find the accuracy on the test dataset. Then see how much accuracy you are getting.
hi, @ Miguel Patrício i have first split the data set into train and test set, 80% and 20% respectively and got 88% accuracy. but when i apply k fold, my accuracy is reduced.
Rojalina Priyadarshini i have received 88% by same process you described
More you can do is that, the split points you can choose random. Lets suppose your data contain 100 instances, at first iteration, you select 1 to 80 instances as training and rest for testing. At the second iteration, select 20th to 100th instances, as training and rest as testing and so on. While calculating the accuracy percentage, take the average of all the test sets. I hope you may get a correct accuracy percentage.
" i have first split the data set into train and test set, 80% and 20% respectively and got 88% accuracy. but when i apply k fold, my accuracy is reduced. "
When you use k fold with k=5, your scenario (split to 80%train and 20%test) repeated 5 times with 5 different test data (each time a new 20% of all data). therefore this result is more accurate and different than just one time split them and may be reduced or increase.
It is natural. Cross-validation almost always lead to lower estimated errors - it uses some data that are different from test set so it will cause overfitting for sure. But the percentage of decrease is quite big, and if you have big sample size and it can not be explained with stochastic effects, I would suggest that your classification method is overcomplicated.
There are a few reasons this could happen:
Your "manual" split is not random, and you happen to select more outliers that are hard to predict. How are you doing this split?
What is the K in k-fold CV? I'm not sure what you mean by Validation Set Size, you have a fold size in k-fold CV. There is no validation set, you run the cross validation using your entire data. Are you sure you're running k-fold cross validation correctly?
Usually, one picks k = 10 for k-fold cross validation. If you run it correctly using your entire data, you should rely on its results instead of other results.
when you are saying you are getting 88%, is it 88% for your training data or your testing data? Accuracy of your testing data or cross validated data is the reliable one not the training data.
I agree with Zahid and others who have explained the k-fold cross validation method. Note that k-fold cross-validation reduces overfitting, it does not completely eliminate overfitting. So i will trust the results of your cross-validation over your manual split of the data.
" i have first split the data set into train and test set, 80% and 20% respectively and got 88% accuracy. but when i apply k fold, my accuracy is reduced."
This is actually not what you are supposed to do. You are repeating the splitting, training and testing part k times in the cross-validation process and average the final result. There also is no "manual" part.
If your data-set contains 100 data points, you randomly select 80 for training and 20 for testing (and repeat the process k-times)
Your first (random) split into 80/20 produced very optimistic but incorrect accuracy results. The reduction in accuracy that you observe is exactly why k-fold crossvalidation is used.
Usually, if your sample is sufficiently large to represent the true underlying class distribution, then you don't need to cross validate. However, for smaller datasets it's helpful for gauging how well your algorithm would perform in actual would application.
There's nothing wrong with your accuracy, it did not drop. You just had a bad initial estimate.
If your data is small amount, you can try 10-fold cross validation. If your data size is large then you can use 5-fold cross validation or 70-30, 80-20. However, your question is not clear, how many fold you are using and what is your data size.
It is some strange result. Theoretically LOO (Leave -one-out) supports best accuracy, k-fold - less better, and holdout (your approach 80:20) - worst. But randomly (for your task, for your partition, etc.) it may be opposite. I also sometimes have got results, that 3-fold Cross-Validation is better for comparison with 5-fold.
Understand the purpose of cross validation. If the accuracy of CV is low, then either you need more data for training or you have to improve the model or the model chosen is not good enough. One should not firm up the model only based on training accuracy