We want to measure the level of agreement between several raters. They classified drug related problems for 20 different scenarios. We used an adapted PCNE classification where they coded Problems, Causes and Interventions. Since the data are nominal we aimed to use the Fleiss Kappa. However, the categories are not mutually exclusive (e.g. there are several possible problems in one scenario) and the raters used several different codes in Problems, Causes and Interventions section. It seems that the Fleiss Kappa needs mutually exclusive categories. Any ideas on how to measure the level of agreement in such cases?