How can inter-rater reliability (IRR) test be performed?

02 February 2015 4 3K Report

I have conducted a study in which experts are asked to identify problems in two different systems. They classify each problem found into one of the fifteen types and rate it into one of the four categories. The number of problems identified by each expert varies. How can I apply inter-rater reliability (IRR) to show if the test is consistently used by the experts?

Subhash Chandra

This may help you

http://en.wikipedia.org/wiki/Inter-rater_reliability

Khubaib Amjad Alam

This may be relevant :

Inter-rater reliability in child sexual abuse diagnosis among expert reviewers

Ariel Linden

Stata has quite a flexible command for IRR using -kappa-, which allows you to test (1) more than two raters, two ratings; (2) more than two raters, more than two ratings, fixed number of raters; (3) more than two raters, more than two ratings, varying number of raters. Also, you can use weights for adjustment. There is also a user written command that provides confidence intervals (ssc install kapci).

Eric L Hargreaves

All about relatedness and therefore all about correlation in one form or another. Your particular difficulty is that you have multiple raters, of which not all rated the same problems in the two systems.

The strongest analysis you can perform would be the subset where all the commonalities intersect. ie all the problems identified by all the raters, who then give a categorical ranking.

I suspect your dataset is more patchwork than this however.

I am guessing two systems were provided, and a number of experts were asked to detect problems in two systems. the number of detected problems were were subjectively decided and then subjectively identified as 1/15 types, and then ranked as belonging to 1/4 categories. As such, inter-rater reliability can relate the number of problems detected in each of the two systems, by the different raters. the inter-rater reliability can also relate the distribution of the different types of problems identified in the two systems, and finally for each type of problem identified the inter-rater-reliability can relate the 1/4 categorization of that problem type.

Since there are no direct rankings of a common scale, (ie items 1-10 ranked each on a scale of 1-5 where one means horrible and 5 means best), your inter-rater reliability will more than likely rest with creative ways of making counts of instances that are common, or a creative way of relating distributions that you order intentionally across the raters, but that have no innate ranking of best to worst etc...

Good luck.

How to learn more about SPSS and its Application?

Baseline drift in HPLC? What causes this?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How to calculate CCS for Sodiated adduct ions and Multiply Charged Ions?

If we are using snowball sampling technique, how do we justify the true representativeness of the sample statistically? is there any statistical test?

Could dyes amplify the spectrum of light to a specific wavelength?

How to report results of Generalised Linear Mixed Models in a journal article?

I need the datasets of Microgrid for system identification?

Should I remove an item from a scale to raise Cronbach's alpha and McDonald's omega or is it better to leave it if they are both over .7 already?

Posthoc test lettering in JAMOVI?