I want to calculate and quote a measure of agreement between several raters who rate a number of subjects into one of three categories. The individual raters are not identified and are, in general, different for each subject. The number of ratings per subject varies between subjects from 2 to 6.
In the literature I have found Cohen's Kappa, Fleiss Kappa and a measure 'AC1' proposed by Gwet. So far, I think that Fleiss measure is the most appropriate, although he derives it assuming than the number of ratings per subject is the same for all subjects.
The Gwet measure AC1 is supposed to deal with the apparent 'paradox' of low agreement values despite a large percentage agreement. Unfortunately, I do not understand the derivation of this measure.
I would be grateful for any comments and suggestions, particularly on the appropriateness of Fleiss Kappa and the soundness of Gwet's AC1.