I am hoping to compute intraclass correlations to evaluate interrater reliability (IRR) for interviews conduct with participants in a clinical psychology study. Specifically, we want to determine IRR for sums of symptom scores made in interviews with participants (e.g., for determining agreement for rating a total score for depression symptoms).
We will have a relatively large sample of 200 participants, so in this case, which specific research team member interviewed and made initial ratings of participants varied. A second rater will listen to recordings of the interview and will make their own ratings, but who this second interviewer is will vary rather than having the same person listen to all or a subset of recordings.
So, to summarize, we will have two columns of data: One with ratings from the original interviewer (with participants being interviewed by different interviewers) and a second based on ratings from a second researcher (though the specific researcher will vary within the column; not the same person). The specific interviewer varies within each of the two columns then.
I believe that we want to focus on absolute agreement (rather than consistency) and to report mean/average measures agreement rather than single measures agreement. I am unclear though on whether a one-way random or two-way random model is more appropriate (a two-way mixed model doesn't seem to fit based on what I've read). Based on reading Khoo & Li (2016), I believe a two-way random model is more appropriate given that our goal is to determine reliability across clinical raters with training in structured interviewing and clinical topics, but I would like input on this point in particular (whether a one-way random or two-way random approach is more suitable).