Qualitative study- how to calculate Inter-rater reliability when we have three different sets of data?

06 June 2019 6 6K Report

I am doing a research and I need some advice, I'll appreciate if you can help me. we ( two coders) read two articles, the first article is the original one, and the second article challenges the first article. in the first article, we find some claims (arguments) and in the second article we find some other claims (counter-arguments) and then we see how counter-arguments challenge the first argument (for example it can challenge by saying for some conditions this claim is not right, and so on). I do not know how I can find inter-rater reliability of coders for the following scenario:

Coder 1 finds argument A1, B1, C1 in the first article Coder 1 finds argument A2, B2, C2 in the second article

and she explains how counter-arguments challenges original arguments. ts.

Coder 2 finds argument D1, E1, F1 in the first article Coder 2 finds argument D2, E2, F2 in the second article

and she explains how counter-arguments challenge original arguments.

Coder 1 and 2 agree on the challenges but they did not find the same claims in the articles.

Or we many have conditions that coders agree on arguments but do not agree on challenges, or we may have conditions that coders agree on challenges but do not agree on arguments, and so on!

I am so confused, how we can calculate inter-rater reliability of these cases. for example, if I only calculate inter-rater reliability based on agreement on challenges, but it does not make sense since the original and counter-arguments of coders could be different.

If I calculate inter-rater reliability for the arguments, then what if the challenges are different! or do you recommend the coders just keep the arguments and counter arguments that match and then calculate inter-rater reliability for challenges they made for each pair of argument&counter-argument? or can we report three different inter-rater reliability in the paper?

Do you know any paper that is published and addressed the same problem.

I appreciate any help

Bonita Lee

In cases like yours, i believe it would be wise to get a third coder. He/She will see which one is better and you can use that one. So basically, the inconsistencies were solved by consensus by three coders. This is also why i personally prefer odd number of coders when I do thematic analysis, bc I cannot just compare/correlate the scores (bc the data is categorical and not numerical).

Hope this helps.

Dr Manzoor Hussain

Hanieh - To calculate Inter-rater reliability having three different sets of qualitative data, usually two tests are frequently used to establish inter rater reliability: percentage of agreement and the kappa statistic. To calculate the percentage of agreement, add the number of times the abstractors agree on the same data item, then divide that sum by the total number of data items or count the number of ratings in agreement and count the total number of ratings and divide the total by the number in agreement to get a fraction. For example, if the number of ratings in agreement is 5 and the total number of ratings is 7. Divide the total by the number in agreement to get a fraction: 5/7. Convert to a percentage: 5/7 = 71.42%. Kappa is one of the most commonly used statistics to test interrater reliability. Like most correlation statistics, the kappa can range from −1 to +1. Please see the following article for further reference:

Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial

Kevin A. Hallgren

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032/

David L Morgan

I recommend looking at Krippendorff's alpha coefficient, which can handle a variety of different formats for the number of different formats.

Hanieh Javadi

Thank you so much everyone for your responses. Do you think if I can use Cohen's kappa when the third category depends on the the first two categories?

Paul Wilson

In it's rawest form, IRR methods simply compare the rate of agreement between two or more coders. To achieve this, both coders need to have the same basis on what is actually being coded. In content analysis, this is known as the Unit of Analysis, the individual component on which a coding decision is made.

In your case, it would appear that you have two potential units, either the articles themselves or the arguments noted by two coders. The latter seems more applicable given the nature of the study you have highlighted.

Thus, the decisions you are comparing are the judgements on argument and counter-argument. For each of these, the decisions each of you has made must be compared to ensure that the coding scheme you've made is sufficiently reliable.

In the above case for Article 1 (Co1 = Coder 1 and Co2 = Coder 2):

Co 1 Co 2

A1 N/A

B1 N/A

C1 N/A

N/A D1

N/A E1

N/A F1

This would give us 0% agreement under all IRR methods. What is important is that both of your coding judgements agree, rather than if you both spotted the same trends in the data. If you got the above results, it's probably indicative that the coding scheme is extremely unreliable. The result informs other workers of whether or not your scheme is reliable enough to obtain meaningful results.

It is also feasible to split up your IRR results into argument and counter-argument and report them separately if you feel that it would make your data more transparent.

I would also follow the advice of David L Morgan. Krippendorff's alpha is the gold standard in IRR methods and can compare an unlimited number of coders with missing values, while being a more accurate and conservative estimate. It's a little complicated to work out, but I've found this calculator does the trick: http://dfreelon.org/utils/recalfront/recal2/

Be sure to run the analysis manually as well rather than just relying on the calculator.

Regards

Paul

Hanieh Javadi

Thanks so much for your response.

If I do the following, do you think it will be okay?

I use a third coder, and based on the third coder's opinion, we only keep the original arguments and counter-arguments that the third coder approves.

Then for the part that how the counter arguments challenge the original arguments, I use Cohen's Kappa and see to what degree two coders agree on how the counter-arguments have challenged the original arguments.

Do you think this works?

Big Sample size, Small coefficients, significant results. What should I do?

How to define an anisotropic material with asymmetric elastic compliance/stiffness matrix in ANSYS APDL?

Is this a facetotecta nauplius?

Request Python code?

May members post flyers about opportunities to present at a conference? If so, where to post?

Hello all, Looking for international reviewer to review Ph.D thesis in wireless sensor network.Can anybody help?

Why does everyone use vs code?

Research Methodology - Impact of Corporate Reputation on Stakeholders Behaviors?

If we are using snowball sampling technique, how do we justify the true representativeness of the sample statistically? is there any statistical test?

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

In the situation below, can I apply braun and Clark thematic analysis?