Which statistical significance test is better to compare the classification performance of two algorithms?

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Can we mark 'EFL Learners shifting from general digital to AI technologies' as technological transition?

After COVID-19 it has seen that EFL learners technological affiliation has raised. In addition, in the post-COVID period learners started to engage AI technologies like ChatGPT while learning...

08 August 2024 8,964 4 View

What are examples of AI for good projects a teacher can assign to students?

So I am organizing an AI seminar. What are possible AI projects in the AI for good spirit? something the students can do and have an impact?

08 August 2024 9,437 4 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View

How to design human-centered classroom in the age of A.I.?

08 August 2024 347 5 View

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

05 August 2024 8,836 2 View

Measuring the Intelligence of a Species?

Larger brains, which typically contain more neurons, store and transfer more information (Tehovnik and Chen 2015), but the precise relationship between number of neurons and information has yet to...

05 August 2024 1,238 2 View

What's the role of IT & AI in Telecommunication Industry?

05 August 2024 8,264 3 View

Can usage of AI tools like chat GPT in research work is recommendable ?

AI tools like ChatGPT can enhance research work significantly when used responsibly and in conjunction with thorough human oversight.

05 August 2024 1,842 3 View

Arturo Geigel

Take a look at :

http://www.ats.ucla.edu/stat/stata/whatstat/whatstat.htm and

http://www.ats.ucla.edu/stat/mult_pkg/whatstat/choosestat.html

also look at the paper by Demsar "Statistical Comparisons of Classifiers over Multiple Data Sets" for a discussion on statistical tests for applications to ML.

Hope this helps

James Walter Taylor

If I understand correctly how you are applying the statistic, your result would indicate that statistically one classifier is not equivalent to the other, i.e. you would expect that one is consistently better than the other. As I understand it, the magnitude of the statistic is not necessarily an indicator of the improvement, only of the likelihood of a real change.

To recommend a comparison metric is a challenge without knowing more about your problem - particularly, what are the relative misclassification costs and the prevalence of your classes. Generally it is possible to create a performance metric specific to your problem to find the "best" classifier - then use McNemar's test to find whether the difference is significant.

Bharath Bhushan

I am using only one data set. The classification accuracy is used as the performance measure. My work was to combine multiple classifiers and to produce better classification accuracy then single best classifier considered. Here I need to perform the significance test, to say that the combination of multiple classifiers has produced statistically significant improvement over single best classifier. I have obtained 2-4% accuracy improvement over the single best classifier. While I performed McNemar Test, it produced values 250. So I am in confusion,whether the considered metric is correct.

You've shown that the difference is "significant", rather than an artifact of random behavior, but not that it is a real improvement. You haven't shown the value of the improvement, whether it is important or worth doing.

I'd suggest if, in this exercise, performance is your only measure of importance, that you employ equal sampling of each class and N-fold cross validation, to show that the confidence in your performance measurement is high enough that the improvement will generalize, as well as provide statistical significance.

Nenad Tomašev

In classification, I have used mostly the corrected re-sampled t-test when doing cross-validation for evaluation, as there are dependencies between individual runs, so the basic t-test tends to declare more results to be statistically significant than there actually are. However, the correction sort of fixes that.

Of course, regardless of whether the difference is 'stable' (= statistically significant), the question remains whether the absolute improvement itself is significant and offers benefits in practical applications.

Another thing that You might consider doing is characterizing the improvements by detecting the types of points for which the new algorithm offers improvements. This case-based analysis is often a good addition to any serious study.

Christos Dimitrakakis

I would actually recommend, since there are correlations between the output of classifiers depending on the dataset and data point used (i.e. they probably all make errors in difficult points or have a worse average error in harder datasets), some type of bootstrap statistic. Bootstrap statistics make no assumptions about the distribution of errors and can be mechanically applied to almost any problem. It is difficult to get theoretical guarantees for them, but it is also difficult to verify whether the assumptions between standard parametric tests hold in practice..

Mohsin Bilal

Try Kolmogorov-Smirnov test

http://www.physics.csbsju.edu/stats/KS-test.n.plot_form.html

The suggestions lean toward goodness of fit and significance tests. In different words, these say much about whether the difference is real, but not a lot about the degree of the improvement.

In the background, McNemar's test is described as producing a statistic of 250. So the classifiers are significantly different. But if the confidence intervals around the sensitivity and specificity estimates, for instance, are +-2% a 4% improvement in performance means very little. The difference needs to be statistically different, significant, and of magnitude large compared to the confidence in the estimate.

Venkatnaresh babu Kuppili

there is no unique answer.Test,F-test,precision, Recall,...etc can be used

For a good discussion of related issues, and one pointer into the literature, try:

http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.129.2536

Ignacio Alvarez

I think that to say that the combination of multiple classifiers has produced statistically significant improvement over single best classifier you should calculate the sensitivity and specificity of them compared to the accurate (true) classification. Then you should decide wich one (sensitivity or specificity) is more important for the matter you are evaluating and compare the proportions between these two algorithms to test whether a significant improvement exists or not.

I do not think so if the difference was 5%. It would be good to know what the sample size is. If you want to, send the 2x2 table and I try to review your calculations.

Mousami Srivastava

ROC and AUC with binary classification model.

Sujatha Srinivasan

In comparing classification performance of two classifiers, usually average over a set of runs along with the standard deviation is used in the literature. However, p-test has also been used in comparing performances of classification algorithms.

See the following article in comparing the classification performance using p-values by Del Jeus et al:

http://sci2s.ugr.es/keel/pdf/keel/articulo/ieee-tfs-2004-156r.pdf

I have used average, standard deviation, minimum and maximum classification rate to compare my algorithm with others for different data sets as used by Baykasoglu and Osbakir in their article "MEPAR-miner: Multi-expression programming for classification rule mining" whose article link is:

http://www.sciencedirect.com/science/article/pii/S0377221706010484

For p-value you can refer the following links which are very simple:

http://en.wikipedia.org/wiki/P-value

http://www.stat.ualberta.ca/~hooper/teaching/misc/Pvalue.pdf

Rambo Wang

The 5 by 2 cross-validated t-test is a famous significant test for comparing performance difference of two classifier. It was proposed by Dietterich in 1998. I think you could try it. In addition, the 5 by 2 cross validated F-test is a better choice. It was given by Alpaydin in 1999.

Theophano Mitsa

You can test the statistical significance of the difference in the performance of two classifiers by computing their error rates and then using a t-test on the difference of the error rates. An example of such a comparison is provided in Chapter 4.6. of this book:

"Introduction to Data Mining", by Tan,Steinbach, and Kumar