Which books about the analysis of psycometric scores and questionnaires have you read so far?
Generally:
The Likert scale is not numeric. It is ordinal. The digits (like, for instance, 0, 1, ... 5) are only labels for the categories. These are no values. You could use other labels like a, b, ... e, or black, dark gray, ... white.
Now if you can explain how you can sensibly calculate a sum or a mean of the labels {a,b,a,c,c,d,b,a}, then a t-test ould at least make some sense, at least.
Analyzing the ranks of the values is possible, but that's different to the task of a t-test. This is what "non-parametric tests" do, but this adresses a substantively different statistical hypothesis.
You should be a bit more clear about what statistical hypothesis you want to test (and why you want to test it).
At least 90% of what is written in papers is "wrong". Papers are not a good source to learn and understand how things should be done. One can look into papers to get some ideas, but one should not take the published methods as adequate just because they have been published. The important work is to really understand a method. This is harder work that simply looking how others did it and repeating this. Unfortunately our current system does not support this way. In contrast, our system supports publications that ideally repeat the methods that are published even when it is clear that these methods are inadequate or even flawed. Readers won't invest the time to learn and understand other methods, reviewers feel unconfortable to see manuscripts that question the correctness of the methods that have been used by themselves before, and editors are scared to put readers off.
Which books about the analysis of psycometric scores and questionnaires have you read so far?
Generally:
The Likert scale is not numeric. It is ordinal. The digits (like, for instance, 0, 1, ... 5) are only labels for the categories. These are no values. You could use other labels like a, b, ... e, or black, dark gray, ... white.
Now if you can explain how you can sensibly calculate a sum or a mean of the labels {a,b,a,c,c,d,b,a}, then a t-test ould at least make some sense, at least.
Analyzing the ranks of the values is possible, but that's different to the task of a t-test. This is what "non-parametric tests" do, but this adresses a substantively different statistical hypothesis.
You should be a bit more clear about what statistical hypothesis you want to test (and why you want to test it).
At least 90% of what is written in papers is "wrong". Papers are not a good source to learn and understand how things should be done. One can look into papers to get some ideas, but one should not take the published methods as adequate just because they have been published. The important work is to really understand a method. This is harder work that simply looking how others did it and repeating this. Unfortunately our current system does not support this way. In contrast, our system supports publications that ideally repeat the methods that are published even when it is clear that these methods are inadequate or even flawed. Readers won't invest the time to learn and understand other methods, reviewers feel unconfortable to see manuscripts that question the correctness of the methods that have been used by themselves before, and editors are scared to put readers off.
Curiously, the t-test has performed well in simulation studies on Likert-type items. There is a lot of folklore about when you can and cannot use it, but the test has turned out to be a lot more rugged and versatile than the theoreticians predicted.
See : http://bcss1.blogspot.ie/2015/02/myths-and-nonsense-about-t-test.html
I am concerned with some points in the document you cited:
"Simulation studies comparing the t-test and the Wilcoxon Mann-Whitney test on items scored on 5-point scales have given heartening results. In most scenarios, the two tests had a similar power to detect differences between groups."
This makes no sense (to me). The concept of power is related to an effect size (more specifically: to a difference between two hypotheses within a statistical model). I wonder how this can be compared between t-test and Wilcoxon-test? Both use different statistical models and the effects are different in their nature.
Where is the law saying that the categories have to be coded with equally spaced numerical values? Any other mapping from the scores to a numerical variable is admissible, but the comparability of these simulations will depend on this mapping, so this "general" result that is proposed is as arbitrary as the mapping.
However, there might well exist a resonable mapping from scores to numerical values. If such a mapping exists or can be argued, then one has sensible (numerical) values and a quantitative (parametric?) analysis does make sense.
Putting "power" aside and just looking at the p-values, it might be that the same sets of data will give a similar distribution of p-values. But since these p-values refer to different statistical models and hypotheses, a particular p-value of one test might not have the same meaning like the same p-value of the other test. So one may attribute a different "significance" to both. I don't see how they should be comparable - unless one agrees that the mapping is sensible.
If you really don't care about the mapping it seems to be that simply the "t-test on ranks" is compared to the Wilcoxon test. These are very similar in fact (and asymptotically equivalent). But it still makes a substantial difference if I want to look at ranks or at expectations.
Finally, I don't see that BCSS adress the "boundary problem", i.e. when the distribution of the answerd is L-, J- or U-shaped, what is a possible (if not likely) event in questionnaires. I am afraid that a standard application of tests (be is t-test or Wilcoxon) will just ignore this, that reaserachers don't even look at such things, believing that the test is anyway "robust", thereby missing possibly the most important insights they could get from their data.
It might seem trivial (and not the best example), but it illustrates my point:
Consider you actually want to measure a concentration of some substance in a solution.
The color intensity of the solution depends on the concentration (like a "strong coffee" is darker than a "light coffee").
You ask people to rate the concentrations on a Likert scale: "very low", "low", "so-so", "high", "very high" (concentration) by judging the color of some solutions.
Now the crucial point: the color-intensity is not a linear function of the concentration. In my example consider a logarithmic relation.
now the ranks contain considerably less (and different!) information than the actual concentrations.
I attached a picture of a simulation, once using a logarithmic relationship between color(->score) and concentration and once a linear relationship. The plots show the sorted p-values. The sample size was 20, the standardized effect was 2 (either on the log-scale or on the linear scale). The "t-test on scores" (red circles) is added to show the equivalence to the wilcoxon test. Both plots show that - if the quantitative information is available! - the t-test gives systematically smaller p-values, since some information gets lost by discretization to a few categories. This difference is extreme for the logarithmic relationship, because not only is the information used by the scores more "coarse" - it also misses (or mis-maps) the underlying relationship.
maybe I understood your example not correctly, but doesn't it address a different problem? Your example shows that a rank based coding as compared to real values leads to loss of information and hence to less sensible tests (Wilcoxon and ttest on ranks/scores as compared to ttest on values) especially when the relationship is non-linear. But this questions the use of rank based assessments in the first place and not the choice of the tests, doesn't it? As your example shows, IF your data has already been gathered as ranked information (for example on a likert scale), p-values from ttest and Wilcoxon test do not differ (red and green dots).
Dear Jochen - I would recommend you have a look at the paper. Since it's a simulation study, you can set up distributions and differences between groups, and to my mind the authors examined a range of plausible scenarios. But I'd value your comments.
I might not have made clear enough the difference between scores, values for the scores, and ranks.
The scores are (ordered) categorial values. A T-test does not work on scores. To apply a t-test you need a numerical coding of the categories.
Wilxocon effectively uses the ranks and thus tests (at least asymptotically) the expected rank-difference (more precisely the stochastic equivalence of the ranks).
If you manually map the categories to numerical values that are ranks (or some linear transformation of the ranks), the t-test does just the same; it is an approximate test of the expected rank-difference (which is asymptotically exact).
But why on earth should the categories be coded by their ranks?
If you use a different coding, t-test and Wilxocon test do something very different. The Wilcoxon test will still be about the ranks, but the t-test will be about the expected difference of these mapped values. This is relevant for the interpretation of effect sizes and the meaning of the p-values.
I read the Blog post of BCSS you attached. I can't see the logic in that post. The authors do not mention with a single word why the ranks are used for the t-test and not some other numeric values. And then they write:
[...] With small samples and odd-shaped distributions, it's wise to cross-check by running a Wilcoxon Mann-Whitney test, but if they disagree, remember that they test different hypotheses: the t-test tests for differences in means, while the Wilcoxon Mann-Whitney tests the hypotheses that a person in one group will score higher than a person in the other group. There can be reasons why one is significant and the other isn't.
Now what? Does it or not matter what hypothesis one is testing? They seem to say (as I understand) that it does not matter as long as the p-values are in-line. I can not subscribe to this. Firstly, I think it is very relevant what hypothesis one wants to test (and if this hypothesis is sensible). Secondly, the tested hypothesis must be clear to be able to pin down the effect size and to judge the p-value (or, in case of a formal hypothesis test, to state the expected effect before-hand and give the utility function and chose reasonable values for alpha and beta).
This is especially important when the sample size gets larger, because p-values become smaller and not indicating a strong effect anymore.
Having two tests that just give similar p-values can just be a sign of the fact that both tests test the same non-sensical hypothesis. Let me give a (possibly too) simple example:
You ask two groups (say, two sports clubs) for their monthly contributions to their club. You use a Likert scale for "100 $". The analysis of the ranks will test the stochastic difference between the ranks of the incomes, the null hypothesis is that the probability of a higher contribution of a member in group B is 0.5.
A t-test on the actual contributions would test the hypothesis if the expected contributions per member for both clubs are equal.
If it is possible (I haven't checket that!) that the rank-test indicates a stochastic inequality but the expected contributions are the same, it matters much which hypothesis to test (it may depend if you look at it from a social or economical perspective).
It may well be that I really and deeply misunderstand the whole thing, so I am really grateful for any advice/correction/comment!