I am currently devising a questionnaire for an interrater agreement analysis of essays written by students at my university. Each essay is examined by two raters, who answer a series of questions. Most items in the questionnaire are simply yes/no questions, with the possibility of the rater making observations, which will feed into the qualitative analysis of the study but will not count towards the interrater agreement analysis. My intention is to use the standard Kappa method to calculate the degree of agreeement between raters, as it is generally thought to be an apt measurement of actual agreement in relation to chance agreement with just two raters.

One problem that has arisen is how to deal with questions that may only be answered by one rater. For example, question 11 How many examples did you find of criterion X in the essay is followed by this rubric: If you answered "none" in the previous question, please go to Section two. If you answered "one" or more, please answer questions 12-15. Questions 12-15 ask specific follow-up questions about criterion X-e.g., Does criterion X support the writer's argument? As a result, questions 12-15 would be answered only by the raters who consider criterion X to be present in the essay and not by the raters who consider it to be absent. In some cases, this problem can lead to one rater answering substantially more questions than the rater he is paired with.

Common sense suggests to me that I should only measure the level of interrater agreement for those questions that are answered by both raters: the rater who does not answer questions 12-15 cannot be said to agree or disagree with the other researcher's answer on this criterion, as for her/him , the criterion is not present in the essay; furthermore if answers 12-15, and other questions in which there is only one answer for the same reason, are counted as disagreement, this will reduce the interrater agreement score drastically, despite the relatively high level of agreement on questions answered by both raters. For example, on the basis of the trial analysis that we are currently conducting , one pair of raters would have a Kappa score of 0.75, corresponding to the upper range of "substantial agreeement" if we count only the questions answered by both raters, but a score of only 0.44, in the lower range of “moderate agreement” if we count all the questions. Agreement levels should increase in the final analysis owing to training and the disambiguation of some items in the questionnaire. Nevertheless, the fear remains that a decision to count questions answered by only one rater as disagreement will unduly lower the agreement score.

However, common sense might not necessarily correspond to what is considered to be good statistical practice. Does anyone know if the questions that are answered by just one rater have to be included in the interrater analysis? Is there a way round this?

More Chris Turner's questions See All
Similar questions and discussions