In qualitative thematic analysis, two raters are invited to code the qualitative data into various themes. How can I calculate the inter-rater reliability, and should the calculation be built-in in the process of coding?
The inter-rater reliability (IRR) is easy to calculate for qualitative research but you must outline your underlying assumptions for doing it. You should give a little bit more detail to the type of qualitative methodology you are following and why you are using IRR. According to some disciplines, schools of thought, IRR is unnecessary as the researchers themselves bring varied but valid perspectives to identify unique codes and themes in the data.
Here are some questions you should ask yourself,
1) Am I looking for generalizability of the findings beyond the sample to an entire population? (often not recommended, but if so desired it requires large sample sizes)
2) Am I looking to generalize beyond the raters to make a statement about how they interact with this specific subject? (for identifying latent traits contributing to some phenomenon or another)
3) Am I doing this because a journal/supervisor is asking for it regardless if it is helpful?
Once you have figured out the answers to these questions then you can start to choose which approach best answers your research questions from those currently in use.
Here is a link on that outlines general idea behind IRR and provides info along with additional resources on calculating the most notables, percent agreement, Holsti's method, Scott's pi (p), Cohen's kappa (k), and Krippendorff's alpha (a),
Thank you Mr. Chan for asking this question. I too have interviews coded in NVivo and realized the inter-relater reliability among the three coders is not a straight forward operation. One of the coders was at a quals training and clarified with them that it could not be simply run in NVivo for the three coders.
Robert Rivers questions are helpful and I am going to the link to read some more...best wishes on your research!
I agree that three coders is not a straight forward operation, but I am still puzzled by two coders issue.
Robert's link is really helpful. I try to respond to Robert's question here,
1) I use inter-rating for my coding because I want to increase trustworthiness, not so much to generate it. I tend to agree with the argument that IRR is easy to calculate and should be clear about the underlying assumptions. Why do you do it? If it is for generalizability, then it may be seen as a weak attempt to make a qualitative study achieve that. Qualitative study is not known for its power and purpose for that. Given you small sample size, it is even more difficult;
2) the context of my study is a qualitative research by means of focus group with 30 couples, four men groups and four women groups respectively. Two raters were invited to code the qualitative data, about gender intimacy, into various themes;
3) I am doing this for academic publication purpose, I have faced critiques from the reviewers about lacked of IRR in the manuscript.
I am studying your link but hope to have further advice.
I have done thematic analysis of focus group data but the reviewer of my article asked me that why I have not used kappa for inter-rater reliability, although the themes were verified by the subject specialists too. Kindly help me with the reference that how i can defend this objection.
I know it's a little bit late to answer your question, but I'm sure many researchers are looking for an answer for the same question. Here is my recent paper that I hope it can directly answer your question. It also suggests an approach to calculate intercoder reliability:
Nili, A., Tate, M., & Barros, A. (2017). A Critical Analysis of Inter-Coder Reliability Methods in Information Systems Research. Australasian Conference on Information Systems, Hobart, Australia.
The title of the paper includes the words "information systems", but actually the paper is very relevant for most business, management and other relevant fields in social sciences. We are using it for various projects here in Australia.
Cohen's kappa can be used for inter rater reliability in thematic and content analysis for two raters. Here is an online tool to calculate it: http://vassarstats.net/kappa.html
For more than two raters Fleiss' kappa can be used.
In any qualitative method one needs to consider the rationale for using a statistical analysis to determine reliability. Does it fit with the underlying epistemological approach? For thematic analysis is would because TA is atheoretical and probably closest to a critical realist epistemology. More problematic with IPA, grounded theory, discourse analysis, situational analysis and others.