I am doing an inter-reliability calculation for an instrument measuring conceptual understanding using an open-ended question. The conceptual understanding instrument is composed of 6 related open-ended questions which will be rated using a rubric for conceptual understanding with 3 criteria (interpretation, application, explanation). Students will have scores for interpretation, application, and explanation using a 4-point scale. I have checked Cohen's, Kappa, and other inter-rater reliability measure but I can find suitable tool for this specific problem. Most stat tools only accommodate 1 question and one measure at a time. What could be the best strategy i can employ to objectively identify inter-rater reliability?