McNemar's test appears to be cropping up more frequently these days and I am not convinced it is often used correctly, so I would like to understand what fellow researchers understand it to be telling us. Many times I read that it is a test to compare accuracy between methods.
Calculation
The test is based on a 2x2 contingency table. We have two binary methods A and B that give a result of 1 (negative) and 2 (positive) and we summarise how often the two methods give the assignments 1 and 2.
So the table's four entries are a frequency count (N) for the number of times we have the result A1B1 (both negative), A1B2 (disagree), A2B1(disagree) and A2B2 (both positive) (I'll drop the AB but always A first then B).
McNemars test is a simple calculation. Chi = (N12 - N21)2/(N12 + N21)
Intepretation
N12 represents the number of samples classified by method A as belonging to class 1 that are classified as class 2 by method B - completely regardless of whether true or not. N21 is the number of samples where method A gives class 2 while method B gives class 1, again irrespective of truth or not.
When comparing two methods, because truth is not relevant it does not provide any insight into accuracy.
In practice it is possible to use McNemars vs ground truth (for the following example method A is ground truth). In this case it is not measuring 'accuracy' as there is no input regarding either correct predictions nor total numbers. N12 are the number of false positives, N21 the number of false negatives.
In this case it is measuring how skewed the test is between false positives and false negatives. I haven't attempted to work out the theoretical implications of this as the various traditional positive/negative summary stats (sensitivity, specificity, PPV, NPV, PDLR, NDLR) are much more readily interpreted for handling false positive and false negatives and therefore more useful.
And if using Cochrane's Q test for multiple comparisons to me including ground truth would be unhelpful as then the resulting statistic is confounded between ground truth comparisons and inter-method comparisons.