I have a dataset comprised of risk scores from four different healthcare providers. The risk scores are indicative of a risk category of low, medium, high or extreme. I've been able to calculate an agreement between the four risk scorers (in the category assigned) based around Fleiss' kappa but unsurprisingly it's come out very low - actually I managed to achieve negative kappa value. I've looked back at the data and there are many cases where, for example, three of the scorers have said 'extreme' and one has said 'high'. Based on normal kappa, this comes out as disagreement, but of course the cases are adjacent so whilst its not agreement is an awful lot better than, say two scorers saying 'extreme' and two scorers saying 'low' as agreement does not fall into adjacent cases.

I understand the basic principles of weighted kappa and I think this is the approach I need to take but I'm struggling a little with weighted kappa given its multiple raters. Does anyone have any experience on this and advice how is it best to tackle this?

More Claire Easthall's questions See All
Similar questions and discussions