From the examples about Cohen's kappa I've found so far, the method seems to work for measuring the inter-rater reliability of mutually exclusive categories only. What if the raters can classify items into two or more categories? What will be a suitable method to measure the inter-rater reliability then? Simple percentage agreement?

More Kit Wan Chui's questions See All
Similar questions and discussions