Observants 1 and 2 both entered a total of 477 observations, divided over 10 cases. So an average of 47.7 measurements per case, where within each measurement the answer options were different (ranging between 2 and 30 options). I have counted that in 440 cases, the observations between observer 1 and 2 are equal. In 37 cases, the observers measured something different. Can I measure the total inter-rater reliability based on this data? How? I believe this does not correspond to the standard Cohen's Kappa cross table (yes / no / yes / no).