When I used McNemar test, it produced values like 250. The accuracy difference of two algorithms was 5%. Are the values produced by the McNemar test correct?
also look at the paper by Demsar "Statistical Comparisons of Classifiers over Multiple Data Sets" for a discussion on statistical tests for applications to ML.
If I understand correctly how you are applying the statistic, your result would indicate that statistically one classifier is not equivalent to the other, i.e. you would expect that one is consistently better than the other. As I understand it, the magnitude of the statistic is not necessarily an indicator of the improvement, only of the likelihood of a real change.
To recommend a comparison metric is a challenge without knowing more about your problem - particularly, what are the relative misclassification costs and the prevalence of your classes. Generally it is possible to create a performance metric specific to your problem to find the "best" classifier - then use McNemar's test to find whether the difference is significant.
I am using only one data set. The classification accuracy is used as the performance measure. My work was to combine multiple classifiers and to produce better classification accuracy then single best classifier considered. Here I need to perform the significance test, to say that the combination of multiple classifiers has produced statistically significant improvement over single best classifier. I have obtained 2-4% accuracy improvement over the single best classifier. While I performed McNemar Test, it produced values 250. So I am in confusion,whether the considered metric is correct.
You've shown that the difference is "significant", rather than an artifact of random behavior, but not that it is a real improvement. You haven't shown the value of the improvement, whether it is important or worth doing.
I'd suggest if, in this exercise, performance is your only measure of importance, that you employ equal sampling of each class and N-fold cross validation, to show that the confidence in your performance measurement is high enough that the improvement will generalize, as well as provide statistical significance.
In classification, I have used mostly the corrected re-sampled t-test when doing cross-validation for evaluation, as there are dependencies between individual runs, so the basic t-test tends to declare more results to be statistically significant than there actually are. However, the correction sort of fixes that.
Of course, regardless of whether the difference is 'stable' (= statistically significant), the question remains whether the absolute improvement itself is significant and offers benefits in practical applications.
Another thing that You might consider doing is characterizing the improvements by detecting the types of points for which the new algorithm offers improvements. This case-based analysis is often a good addition to any serious study.
I would actually recommend, since there are correlations between the output of classifiers depending on the dataset and data point used (i.e. they probably all make errors in difficult points or have a worse average error in harder datasets), some type of bootstrap statistic. Bootstrap statistics make no assumptions about the distribution of errors and can be mechanically applied to almost any problem. It is difficult to get theoretical guarantees for them, but it is also difficult to verify whether the assumptions between standard parametric tests hold in practice..
The suggestions lean toward goodness of fit and significance tests. In different words, these say much about whether the difference is real, but not a lot about the degree of the improvement.
In the background, McNemar's test is described as producing a statistic of 250. So the classifiers are significantly different. But if the confidence intervals around the sensitivity and specificity estimates, for instance, are +-2% a 4% improvement in performance means very little. The difference needs to be statistically different, significant, and of magnitude large compared to the confidence in the estimate.
I think that to say that the combination of multiple classifiers has produced statistically significant improvement over single best classifier you should calculate the sensitivity and specificity of them compared to the accurate (true) classification. Then you should decide wich one (sensitivity or specificity) is more important for the matter you are evaluating and compare the proportions between these two algorithms to test whether a significant improvement exists or not.
I do not think so if the difference was 5%. It would be good to know what the sample size is. If you want to, send the 2x2 table and I try to review your calculations.
In comparing classification performance of two classifiers, usually average over a set of runs along with the standard deviation is used in the literature. However, p-test has also been used in comparing performances of classification algorithms.
See the following article in comparing the classification performance using p-values by Del Jeus et al:
I have used average, standard deviation, minimum and maximum classification rate to compare my algorithm with others for different data sets as used by Baykasoglu and Osbakir in their article "MEPAR-miner: Multi-expression programming for classification rule mining" whose article link is:
The 5 by 2 cross-validated t-test is a famous significant test for comparing performance difference of two classifier. It was proposed by Dietterich in 1998. I think you could try it. In addition, the 5 by 2 cross validated F-test is a better choice. It was given by Alpaydin in 1999.
You can test the statistical significance of the difference in the performance of two classifiers by computing their error rates and then using a t-test on the difference of the error rates. An example of such a comparison is provided in Chapter 4.6. of this book:
"Introduction to Data Mining", by Tan,Steinbach, and Kumar