I have conducted a study in which experts are asked to identify problems in two different systems. They classify each problem found into one of the fifteen types and rate it into one of the four categories. The number of problems identified by each expert varies. How can I apply inter-rater reliability (IRR) to show if the test is consistently used by the experts?