What is McNemar's Test Telling Us?

21 September 2017 7 3K Report

McNemar's test appears to be cropping up more frequently these days and I am not convinced it is often used correctly, so I would like to understand what fellow researchers understand it to be telling us. Many times I read that it is a test to compare accuracy between methods.

Calculation

The test is based on a 2x2 contingency table. We have two binary methods A and B that give a result of 1 (negative) and 2 (positive) and we summarise how often the two methods give the assignments 1 and 2.

So the table's four entries are a frequency count (N) for the number of times we have the result A1B1 (both negative), A1B2 (disagree), A2B1(disagree) and A2B2 (both positive) (I'll drop the AB but always A first then B).

McNemars test is a simple calculation. Chi = (N12 - N21)2/(N12 + N21)

Intepretation

N12 represents the number of samples classified by method A as belonging to class 1 that are classified as class 2 by method B - completely regardless of whether true or not. N21 is the number of samples where method A gives class 2 while method B gives class 1, again irrespective of truth or not.

When comparing two methods, because truth is not relevant it does not provide any insight into accuracy.

In practice it is possible to use McNemars vs ground truth (for the following example method A is ground truth). In this case it is not measuring 'accuracy' as there is no input regarding either correct predictions nor total numbers. N12 are the number of false positives, N21 the number of false negatives.

In this case it is measuring how skewed the test is between false positives and false negatives. I haven't attempted to work out the theoretical implications of this as the various traditional positive/negative summary stats (sensitivity, specificity, PPV, NPV, PDLR, NDLR) are much more readily interpreted for handling false positive and false negatives and therefore more useful.

And if using Cochrane's Q test for multiple comparisons to me including ground truth would be unhelpful as then the resulting statistic is confounded between ground truth comparisons and inter-method comparisons.

Helene Dörksen

Hello James,

I used this test for the comparison of two classification algorithms. Based on McNemar’s test results you can decide which algorithm performs (statistically) better regarding accuracies, which are used in the formula. For the value X, calculated by McNemar’s test, holds: if X=3.84 (or lower), when algorithms have the same error (with probability ~95%). Otherwise, if X>3.84, when the performance of one algorithm is higher.

McNemar’s test is used for testing generalisation ability, e.g. of a new designed classifier. Further, it is well-suitable for the situation, where training set and validation set are predefined.

Helene Dörksen

to my knowledge the formula is: (|N12 - N21|-1)^2/(N12 + N21)

James Renwick Beattie

Thanks Helene, you are right that another form exists, most of the reference works I have seen give the formula I stated.

For the benefit of readers, here is a reference presenting the formula in the question:

http://www.statisticssolutions.com/non-parametric-analysis-mcnemars-test/

and one for Helene's:

https://www.graphpad.com/quickcalcs/McNemarEx.cfm

Whichever form it is presented in would make no difference to the question as stated as it would have the same implications. I am not interested in this thread about relative merits of the two forms.

James Renwick Beattie

Apologies, Helene I wasn't shown your first answer initially and missed it. I may be a bit slow, but could you explain which terms in the formula capture accuracy?

Testing generalisation, I can understand that but I think it doesn't full explain what exactly the test is measuring. To me the test tells us whether or not two tests disagee on different samples. I think it is testing against the null hypothesis that the two tests detect positives and negatives the same samples. When it fails the null hypothesis this means that there is likely to be a meaningful difference in which samples are assigned positive and negative.

It is possible for tests to have identical accuracy but still be significant in McNemar's if one is weighted towards reporting false positives and one towards false negatives.

Helene Dörksen

Hello James,

for comparing performances of two classifiers (let say A and B) it is in the formula above: N12 – number of examples misclassified by A, but not by B; N21 number of examples misclassified by B, but not by A. Null hypothesis assumes that A and B have the same error rate if N12=N21.

In my tests I used McNemar on the same sample. As I understand you have different samples and one single classifier. The question in your case would be: how perform a classifier on different samples.

James Renwick Beattie

Thanks for the reply Helene. I'll address the last point first so you know the context matches your typical situation before continuing with my queries that arise from your response.

McNemar's test must always be used on paired outcomes (so performed on the same samples for classifier comparison), so I am not describing a different scenario. It would be impossible to get any values for the contingency table if the classifiers were applied to exclusive samples.

The only way I can see N12 representing the 'number of examples misclassified by A, but not by B ' as you describe is, if instead of the usual contingency table, you transform it into a truth matrix with true/false for A as columns and true/false for B. I've never encountered this approach in my reading but I can't claim to be extensively read on McNemar's test and would appreciate some useful references on this. In that case I would understand the null hypothesis would be that there is no difference in the likelihood of A being false dependent on B being correct compared with the likelihood of B being false dependent on A being correct. That still doesn't sound like an 'accuracy' test to me (it would ignore all the false predictions both tests record) but still boils down to the same issue of skewness in disagreement.

The approach I've always saw was to use the usual contingency table of predicted class for A vs predicted class for B. When comparing two classifiers (I would never describe the ground truth as a 'classifier' so I assume you are talking about two classifiers, neither of which is considered ground truth. So apologies if this is a source of confusion) my understanding is that the contingency table contains zero information on misclassifications. If I have misunderstood I am afraid you will need to explain to me how I can derive that information from solely the cells N12 and N21 of the contingency table.

The elements used in cells b and c of the standard contingency table comparing two classifiers are just looking at disagreements with no judgement on which is correct. So N12 is not indicating the number 'misclassified by A, but not by B', but as I stated the ' number of samples classified by method A as belonging to class 1 that are classified as class 2 by method B '.

In a situation where the differences between classifiers is completely random then one would expect on average half of N12 to be correctly classified by A and half by B, the same goes for N21. Based on the N12 and N21 alone I cannot see how we an infer the balance of 'correctness' between the two tests.

I'm afraid I am not able to see how your explanations can be derived from the formula, so you'll need to be more explicit in connecting your statements to the elements of the formula and outlining your assumptions to help me grasp it.

Helene Dörksen

My explanations are based on:

Alpaydın, E.: Introduction to Machine Learning. The MIT Press, Cambridge, 2 edn. (2010)

Is there an English Translation of the Carl Moller text: ZUR VERGLEICHENDEN ANATOMIE DER SILURIDEN?

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

How to start a Molecular Dynamics Simulation?

To perform transfection with DharmaFECT Duo in AGS cells. Could you tell me what the ideal concentration is to avoid significant cytotoxicity?

What is meant by baseline of FTIR data?

What is Random Audit?

Is the mentioned CV graph a valid one as this graph have only one peak prominent (reduction)?

Comparison of Methanol and ethanol treatment on collagen based scaffolds?

How we can use lattice-based cryptography for construction of S boxes?

I am working on III-V based tandem solar cells.Can anyone explain that solar cells work under forward or reverse biased conditions?

Which test should be used to study association among demographic profile and awarness level?

Posthoc test lettering in JAMOVI?

How to do Mann-Whitney U test with Bonferroni corrected p-values?

How to change the version of the article full-text pdf file?

Entropy measure and QSPR modeling in Graph Theor. How to construct the table for lengthy equation?

Bonferroni correction. I have independent t-test, paired t-test and ancova conducted. Which test would require Bonferroni adjustment?

Can I use Likert scale with Paired Sample T-test?

Paired t-test or unpaired t-test for my quantitative data with SPSS?

Chi-square test for allele distribution?

How to calculate Cohen's d from CI 95 and t value from a paired sample t test?