I have 3 raters and 3 categories of behavior (Looking at partner's face? Yes, No, undetermined), with a code given every second for 150 seconds for each of my 20 subjects.
I have tried calculating Fleiss' Kappa and Gwet's AC1, but 1) that gives me separate reliability results for each participant and I'm not sure how to get an overall measure of reliability - Do I average the reliability results across participants?, and 2) I'm not sure if these statistics assume independent ratings, in which case they would be inappropriate for time series data.