27 January 2015 8 9K Report

The specific context is industry and occupation codes. These are 6-digit codes that are hierarchical and categorize a person's industry and occupation based on what they report to an open-ended survey question, and there are hundreds of codes total (i.e., hundreds of categories). We are comparing a set of human-coded data against an electronic/automatic coding system. I've Googled and found some good articles on reliability, but nothing (at least nothing recent) addressing a method for checking the reliability of these codes or variables with many codes in general. I'm going to get Fliess's book from the library since I recall that being a helpful one in the past. 

Here's the problem as I see it, and some suggested approaches. All comments welcome. Thanks!

1) Since we have hundreds of categories, and particularly because the codes have a hierarchical structure, a simple agreement rate or kappa will be low simply because of the number of categories and they way they are applied. Minor mis-codings or unreliability in the latter digits will throw of the overall agreement/reliability. For example, the code "lawyer" might be reliable, but the code for the type of lawyer may not be. An overall analysis would result in lower reliability, even if the first digits of the codes are reliable. 

2) In addition, some of the categories are likely used frequently and some infrequently if at all. Something tells me this will be a problem that will attenuate overall reliability, but I can't express it any better than that at this point. 

3) My first thought is to check the reliability on each digit of the codes (or combination of digits depending on how the codes are applied by coders, eg., two- or three-digit chunks). I have to learn what their structure is but don't have the data right now. I believe they are NIOCCS codes (http://www.cdc.gov/niosh/topics/coding/overview.html).

4) My second thought was to split up the data by "job type" (i.e,. the first level of coding) and look at reliability within job type. Similarly I could look at reliability for "professional" v. "trade" jobs if I can find a key for coding the NIOCCS codes into broad classes of jobs. 

5) Finally, my colleagues are talking about doing a "concordance" analysis (must be a public health term). From what I can tell this is just an agreement rate. I'm familiar with kappa and weighted kappa, but not with techniques where there are so many categories. I found the irr R package (http://cran.r-project.org/web/packages/irr/irr.pdf) and read the description of each technique it has but didn't see any specifically for large numbers of categories.

Thanks for your thoughts and leads. 

More Matt Jans's questions See All
Similar questions and discussions