What are the proper techniques for analyzing reliability of categorical variables with a large number of categories?

01 January 2015 9 8K Report

The specific context is industry and occupation codes. These are 6-digit codes that are hierarchical and categorize a person's industry and occupation based on what they report to an open-ended survey question, and there are hundreds of codes total (i.e., hundreds of categories). We are comparing a set of human-coded data against an electronic/automatic coding system. I've Googled and found some good articles on reliability, but nothing (at least nothing recent) addressing a method for checking the reliability of these codes or variables with many codes in general. I'm going to get Fliess's book from the library since I recall that being a helpful one in the past.

Here's the problem as I see it, and some suggested approaches. All comments welcome. Thanks!

1) Since we have hundreds of categories, and particularly because the codes have a hierarchical structure, a simple agreement rate or kappa will be low simply because of the number of categories and they way they are applied. Minor mis-codings or unreliability in the latter digits will throw of the overall agreement/reliability. For example, the code "lawyer" might be reliable, but the code for the type of lawyer may not be. An overall analysis would result in lower reliability, even if the first digits of the codes are reliable.

2) In addition, some of the categories are likely used frequently and some infrequently if at all. Something tells me this will be a problem that will attenuate overall reliability, but I can't express it any better than that at this point.

3) My first thought is to check the reliability on each digit of the codes (or combination of digits depending on how the codes are applied by coders, eg., two- or three-digit chunks). I have to learn what their structure is but don't have the data right now. I believe they are NIOCCS codes (http://www.cdc.gov/niosh/topics/coding/overview.html).

4) My second thought was to split up the data by "job type" (i.e,. the first level of coding) and look at reliability within job type. Similarly I could look at reliability for "professional" v. "trade" jobs if I can find a key for coding the NIOCCS codes into broad classes of jobs.

5) Finally, my colleagues are talking about doing a "concordance" analysis (must be a public health term). From what I can tell this is just an agreement rate. I'm familiar with kappa and weighted kappa, but not with techniques where there are so many categories. I found the irr R package (http://cran.r-project.org/web/packages/irr/irr.pdf) and read the description of each technique it has but didn't see any specifically for large numbers of categories.

Thanks for your thoughts and leads.

Michal Illovský

Hi, I would suggest dropping the last digits to reduce the number of categories to any reasonable number. Then, when the categorization is reduced to the least level that still can be meaningfully ordered, use the weighted kappa.

Ronán Michael Conroy

The preliminary tasks seem to me to be

1. Reduce the classification system to one in which the categories are few enough to constitute distinct groups but sufficiently numerous to ensure that each category is a definable set.

2. Decide on what constitutes a disagreement. Ignore trivial disagreements (though these may be a sign that your categories are too numerous, and do not represent distinct sets).

3. Decide if all disagreements are to be weighted equally. If they are not, then decide on an ordered scale for disagreements.

I would then inspect the disagreements. Which categories are most and least disagreement-prone, and why?

You can calculate an overall statistic, but this is not as informative as an analysis of the cases for which the system works and the ones where it doesn't. It seems to me that your problem is too interesting to be dismissed with a statistic (kappa) whose interpretation is vague and arbitrary.

Matt Jans

Thanks all! Ronan, I like your ideas 2 and 3 (others have mentioned 1). It seems like I could accomplish a version of 2 by setting weights appropriately (i.e. small discrepancies are weighted 0 or even negative). I'm just thinking out loud here.

Tahir Kemal Şahin

Hi Matt, I think your 4th opinion is meaningful and much better. So you can obtain reliabilites by "job types". This approach will be more helpful for your objective. You can use STATA package program for analyzing weigted Kappa for large number of categories.

Matt Jans

Tahir, thanks for that idea. In Stata is this just done in "tabulate" or a different command? I haven't done weighted kappa in Stata before.

Tahir Kemal Şahin

Matt, It can be done by the command of "Statistics>Epidemiology and related>Other". Here you are an example output file which was copied from Stata output and pasted into Winword. The version of the program which I was used was Stata10.

Matt Jans

Thanks, Tahir.

Matt Jans

I wanted to give a quick update and ask a question. We're presenting this work next week and I'll share the poster once it's done. We're seeing agreement rates and kappas in the mid-0.7 to mid-0.8 range for both Census I&O codes and NAICS (I) and SOC (O) codes. I'm used to interpretting reliabilities for psychological phenomena and text content (and don't have a whole lot of experience with that anyway), but these seem relative high to me. For those of you used to looking at I&O code reliabilities, what do you think?

A note on the design: These are agreement/kappa comparing human-coded cases with the NIOCCS auto-coding system (http://www.cdc.gov/niosh/topics/coding/overview.html), not reliabilities between two human coders.

Thanks.

Oluwafemi Samson Balogun

All the explanations given are good ,you work on them

Who are some leading experts on nonprofit organizations and tax filing?

Does asking "Did you have any problems?" versus "What problems did you have?" get you more and better responses?

What are some good publishers for social science research methodology text books?

What are acceptable/common agreement rates or kappas for industry and occupation (I&O) codes?

What are the major (and minor) surveys that interview teens?

Has anyone seen (or calculated themselves) performance benchmarks between Stata and SAS?

How do everyday people think about their "public program participation" (e.g., foodstamps, Medicaid)?

Within-household interviews: Better to ask for more upfront or later?

What are your favorite contemporary papers on multilevel binary and multinomial logistic regression with survey weighted data?

Can anyone recommend a CV category for "edited but not peer-reviewed" articles?

What are the long-term impacts of incarceration on youths' developing brain?

• What the possible Persistent Organic Pollutants and Heavy metals present in fluorspar, sediments, and water bodies around its mining area?

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How are iso-frequency contours plotted?