If you are coding the transcripts of the interviews and/or the self-reports, to put the responses into categories, you could have the coding done independently (after training) by more than one coder, and then compare their results to establish "inter-rater reliability." There is a literature on that, which you can get into via an internet search on the term I have put in quotation marks.
You could do inter-rater reliability if your analysis is based on a content analysis where it is important to count codes, but other than that, it is usually recognized the qualitative analysis process is inherently subjective.
In general, the concepts of reliability and validity are considered more appropriate for quantitative rather than qualitative analysis. The best source on qualitative alternatives to these criteria is still Lincoln and Guba's 1985 book, Naturalistic Inquiry.
Ensuring the reliability and validity of qualitative research tools, such as semi-structured interview questions and self-report measures, is crucial to maintain the credibility and trustworthiness of your study. Here are some guidelines to help you assess and enhance the reliability and validity of these tools:
Reliability:
Before the actual data collection, conduct a pilot test with a small sample to identify any ambiguities or issues with the interview questions or self-report measures.
Revise the tools based on feedback from the pilot test to enhance clarity and appropriateness.
Train interviewers or participants to ensure uniformity in the administration process.
Validity:
Ensure that the interview questions or self-report measures cover all relevant aspects of the research topic.
After conducting interviews, share findings with participants to ensure accuracy and resonance with their experiences.
Use participant feedback to validate the trustworthiness of your interpretations.
Compare interview data with observational or archival data to strengthen the validity of your results.
By carefully considering and addressing these factors, you can enhance the reliability and validity of your qualitative research tools, ensuring that they effectively capture the phenomena under investigation.
Reliability is more difficult to assess in qualitative research, because the goal of developing and describing categories, rather than counting behaviors in those categories, precludes rigorous measurement procedures. Although replication at a second data site might be considered to be like test-retest reliability, strict replication is generally considered impossible by ethnographers because of the lack of standardized controls and because the behavior of people is never static (LeCompte & Preissle, 1993, p. 332). In addition, each site studied to some extent is unique. Multiple sites produce different data sets, which may reveal much about the variability of behavior and the range of contexts but less about reliability.
The concern for convergence (reliability) within the quantitative paradigm can be contrasted with the desire for divergence (developing many categories and elaborating or refining them) within the qualitative paradigm. The empirical world is in constant flux, individuals within it are continually creating and recasting events experienced within ever-changing perceptual and conceptual frameworks (Blumer, 1969b, p. 23). Different persons--including different researchers-- construct the world in distinct ways, thus limiting inter-observer reliability; likewise, any one researcher can change constructions within a short time, decreasing intraobserver reliability--indeed, this is a goal of qualitative research. Reliability describes the consistency of the measuring instrument, and the instrument in qualitative research is the researcher. Likewise test-retest reliability can only be approximated because the test instrument--again, the researcher--is not exactly the same at two points in time (Lincoln & Guba, 1985, pp. 298-299).
A paradox of qualitative research is when the researcher is maximally changing ways of looking at things because of the data observed, reliability as traditionally understood is minimized. It is minimal not because observation is less trustworthy, but because observed data make a difference in how perceptions are conceptualized. It might even be argued that consistencies between observations are more likely to be due to consistent researcher bias rather than to the data being consistently observed. More important than reliability, therefore, is the issue of validity.
However, some qualitative researchers suggest that a degree of reliability is desirable. Reliability is fostered by using low-inference descriptors; comparing multiple observers viewing the same events, which is inter-rater reliability; using research assistants; asking peers to examine findings; or mechanically recording data (LeCompte & Preissle, 1993, pp. 338-340). Reliability could also be represented in terms of the consistency in sorting behaviors into categories or consistency in designating behaviors by function.
In my own dissertation study qualitative reliability is fostered both interpersonally and mechanically. My two sons developed and revised major categories of events and behaviors listed by children interviewed in my hallway study. In the process the boys and I discussed these categories at length, particularly in relation to what activities belonged to which categories. After numerous discussions and changes in categories, over several days' time, we eventually achieved complete agreement in categorizing the behaviors and events within categories.
In addition qualitative reliability is addressed mechanically through the use of videotaping observations and audiotaping interviews. These have not yet been analyzed for reliability through complete transcription; partial transcription procedures were used, but several transcriptions of the same tapes using multiple transcribers is anticipated in the future. Some qualitative methodologists suggest that strict measures of qualitative reliability are impossible and thus rely on indications of validity because the presence of validity implies reliability (Benson, 1994; Lincoln & Guba, 1985, p. 316). Lincoln and Guba (p. 317) also suggest that overlapping findings from different methods of research also indicate reliability, which is a procedure used occasionally in this research study. However, this is perhaps better understood as an indication of convergent validity rather than a measure of reliability, to be considered shortly.
Internal Validity
Internal validity is a major strength in qualitative research (LeCompte & Preissle, 1993, p. 341), evidenced in several ways. Lincoln and Guba (1985, pp. 301-304) emphasize the importance of a lengthy stay in a naturalistic setting, allowing for extended engagement with persons that can reveal the researcher's distortions and selective perceptions. An extended time in the field also allows patterns of behavior to stabilize, which is essential for reliability as well as validity. A lengthy stay also provides informants the opportunity to develop trust in the researcher, imperative for revelation of perspectives. Persistent observation in the research context also provides depth and focus, as more attention is given to multiple influences that surface and detailed analysis is given to those factors that are most relevant or salient while avoiding premature closure. Persistence takes place with an attitude of skepticism, as premature closure on an issue may result in the easy acceptance of deception or pretense to be socially acceptable. The depth resulting from persistent observation balances the breadth that a lengthy stay in the field produces, Lincoln and Guba emphasize. In addition to a lengthy stay with persistent observation, validity is enhanced by reliance on informant interviews for data and by the researcher's constant self-monitoring (LeCompte & Preissle, 1994, pp. 342-348).
I studied the school in my dissertation over a four month period, which can be considered a lengthy period of time. Throughout that time I continually monitored my perceptions and impressions through personal notes and theoretical notes, which are considered in detail in later chapters. This self- monitoring, as well as the extended period of observing allowed me to correct misconceptions and observe the many varieties of the social formations considered. Some of these misconceptions and self-corrections are considered in Chapter Five. I also note in several places my surprise at certain findings, another indication of self-monitoring and thus internal validity. In addition, the several sessions of interviews allowed time for trust to develop with children, evidenced by describing behavior in which they participated that was forbidden and punished; they would be unlikely to do so with other adults in the school due to lack of trust. The depth and breadth of the study are indicated by the description of findings in Chapters Five through Eleven of my dissertation (available on ResearchGate).
Triangulation also helps establish the validity of qualitative research. Triangulation involves obtaining multiple perspectives of the same event; when those perspectives coincide or are similar, this suggests some degree of validity. Several forms of triangulation for establishing validity have been suggested (Patton, 1990, pp. 464-470; Lincoln & Guba, 1985, pp. 305-307). When both qualitative and quantitative procedures produce equivalent results, this is a form of source triangulation. Within the quantitative paradigm, this might be considered convergent validity, without a coefficient, comparing multiple methods in the multitrait-multimethod approach to validity (see Crocker & Algina, 1986, pp. 232-235). Although Lincoln and Guba suggest that discrepant findings between qualitative and quantitative methods indicate the likelihood of error in one of the methods, Patton believes that discrepancies may reflect different kinds of questions being answered as well as the difficulties involved in determining convergence.
Source triangulation is evidenced by the use of both qualitative and quantitative procedures in my study. For example, my impression that phalanxes were more common than clusters, recorded in field notes during observations, was confirmed by quantitative measurement using videotapes. Other examples of convergence between qualitative and quantitative procedures are considered in later chapters on findings.
A second variety is method triangulation, comparing the results of different qualitative procedures. Again, Patton emphasizes that discrepant findings do not necessarily indicate invalid results; rather they may reflect the need to discover why and when those differences occur--different methods may capture different aspects of behavior. In addition multiple investigators or data analysts can be triangulated to determine consistencies, although this requires close communication so that both observers or analysts are studying the same thing.
Several interactive and non-interactive methods were used in my study, which converged occasionally. Many of the comments of children were consistent with what I had observed earlier, even though I purposefully framed questions to them that would not be leading. However, consistent with Patton's comment, there were also differences, particularly in that children described a wider variety of activities than what I observed. This underscored my preoccupation with social formations, which is consistent with my theoretical framework, but also indicated that I overlooked many specific hallway events, such as children sharing food with one another, while doing earlier observations. It is also likely that some of these activities children described were hidden from me during observations but willingly explicated after trust had developed during interviews.
During my research, I asked the undergraduate student who helped with videotaping to keep a record of trends she observed. I gave her no guidelines in this respect. She only jotted a few comments, but these underscored several trends I had observed earlier. Teacher comments during interviews at the conclusion of the study could also be considered multiple investigator triangulation, and convergence between these comments and my findings from observations and interviews of children are noted in later chapters. The videotapes and cassettes recorded potentially allow further triangulation by others in the future.
Lincoln and Guba do not favor the use of theoretical triangulation because they believe that multiple theories explaining the same phenomenon do not indicate evidence for the existence of the phenomenon. Patton, in contrast, emphasizes the value of theory triangulation because this process reveals how different premises and assumptions influence interpretation. I used a limited degree of theoretical triangulation by relating some findings to the two theories that framed the study, theories by Hall and Blumer.
Benson and Hagtvet (in preparation) use theory differently to establish validity. They suggest that multiple studies reflect different aspects of a given theory. These studies can produce results consistent with predictions by different aspects of the theory. The congruities between divergent theoretical predictions and empirical findings constitute a "nomological network" variety of construct validity. This nomological network can be conceptualized as triangulating multiple studies within different aspects of a theoretical framework. Benson and Hagtvet apply this form of validity analysis for quantitative data, but it might be adapted for qualitative validity as well, a possibility I wish to investigate in the future.
LeCompte and Preissle (1994, pp. 341-348) emphasize that qualitative validity is enhanced by the researcher who stays open by means of self-monitoring and the active search for negative cases; this is also emphasized by Lincoln and Guba (1985, pp. 309-313) and Patton (1990, pp. 463- 464). Mehan (1979, p. 20) notes that the search for negative cases helps accomplish the goal of accounting for all incidents, as he does in his three step analysis of teacher-student interaction. The search for negative cases was used in my data analysis, as detailed in Chapter Five of my dissertation, although I note that accounting for every case can be an ideal not always reached.
Interview data are likely to counteract preconceived notions, assuming interviews are open- ended and not unduly influenced by the researcher's constructs, suggest LeCompte and Preissle. My interviews were to some extent open-ended, as I allowed and sometimes encouraged students to discuss related and even tangential issues, and I attempted to avoid leading probes. I encouraged openness rather than premature closure; for example, several times a member of a group would make a comment, to which others agreed, and I would inject a discrepant comment made by a child in another group to encourage discussion of different viewpoints. Yet it is possible that I unconsciously encouraged some reactions by unconscious body language. Some of my questions were clearly related to what I had observed earlier and a-priori theoretical concerns, but I attempted to encourage children to contradict my views if they wished. To some extent the degree of open- endedness and researcher construct contamination may be evaluated by examining interview protocols in Appendix A and my interview approach described in Chapter Four of my dissertation.
Member checks, in which a sample of those studied and others who share the context of the study are asked to verify, dispute, or revise categories and other emergent findings, can also contribute to conclusions on validity, though this procedure is not without its difficulties (Lincoln & Guba, 1985, pp. 314-316). Blumer (1969b, p. 22) similarly speaks of those being studied "talking back" to the researcher, correcting unrealistic portrayals of their views. Member checks were used throughout interviews, as I sometimes paraphrased what one or more children had said previously, either within the group or from another group, and asked if my understanding was accurate. I also used summary member checks of both children and teachers at the conclusion of the study, and both convergent and divergent results are reported in later chapters.
Ultimately, concludes Patton (pp. 468-469), qualitative validity is established by evidence of believability of findings such as including sufficient raw data in the report, such as quotations from participants, and remaining open-ended so that readers are allowed to reach conclusions on their own and develop at least some of their own generalizations. I provide numerous citations of children in later chapters, raw data as evidence for my conclusions. I also encouraged children to make some generalizations during interviews, again evidenced by protocols in appendix A.
Another means of confirming the trustworthiness of qualitative research is through the use of an "audit trail" (Lincoln & Guba, 1985, pp. 319-320), which involves the researcher archiving research materials. These include raw data in the form of videotapes or handwritten field notes, the products of data reduction and analysis such as theoretical notes, indications of data synthesis and reconstruction such as reports and descriptions of category structure development, notes on the process of research such as methodological notes, documents that reflect dispositions and intentions such as personal notes and the research proposal, and information about instruments used such as forms and schedules. These can then be made available for an external audit. The "audit trail" is valuable for both qualitative and quantitative research, Lincoln and Guba claim. I have archived hundreds of pages of notes, as well as many hours of videotape and cassette tapes, which constitute an audit trail that can be examined. Some of these notes are included in later chapters, which constitute a partial audit trail.
In conclusion, it can be noted that internal validity is addressed in many different ways by qualitative researchers. Several of these are considered in my research. The establishment of internal validity involves the degree of confidence placed upon findings, not absolute determination.
External Validity
LeCompte and Preissle (1993, pp. 349) describe external validity as the degree to which a research site is typical, and the likelihood of generalizing results. The credibility of applying findings to alternate sites is affected by the selection of persons studied, the setting in which they are studied, the distinct historical background and situation at the time of the study, and the degree to which the constructs defined are shared across people, settings, and time.
Ethnographic researchers often perceive external validity differently than those scholars operating from the quantitative perspective. In qualitative research, establishing applicability to other sites is considered a joint venture of the researcher and the one making the application. It is important that researchers fulfill their obligations in establishing external validity by describing those studied, the context, and other aspects of the research process, but just as important is the task of the individual who wishes to apply findings to an alternate site. The qualitative researcher provides the data from which generalization is possible, but specific application to other contexts requires knowledge of the second context, which the researcher does not possess (Lincoln & Guba, 1985, p. 316).
I directly addressed external validity in my dissertation study to only a minor extent. This was done by comparing the lower elementary wing of Pellegrini elementary with the main site studied, the upper elementary wing of the same school. Convergent comparisons with other schools cited from the literature can also be considered a way of addressing external validity. More important, I describe in detail numerous aspects of the school environment and surrounding community in the next chapter, providing the basis for future attempts at generalization to other schools. Someone who wishes to make generalizations can compare these characteristics with those of the site to which findings are to be generalized to determine the degree of similarity and thus generalizability. In the next chapter I will make the case that a greater number and greater diversity of constructs emerge from an ideal site like Pellegrini elementary than would be found at a more typical site, thus the likelihood of generalizing certain constructs is greater at an ideal rather than typical site. Finally, I hope that other researchers will study additional sites that are very different from Pellegrini elementary to determine the commonalities across divergent contexts. Commonalities found across very different kinds of contexts suggest the likelihood of greater generalizability.
Summation
Validity and reliability are important in qualitative research, often discussed under the rubrics of credibility or trustworthiness. As noted, some of the qualitative validity issues raised here are more specifically addressed in later chapters on setting, participants, and methodology, although not always overtly linked with the topic of validity.
Although quantitative reliability is addressed in this study by using percentage of agreement and kappas, quantitative validity is not as well addressed in the dissertation as the qualitative equivalent. Because multiple sites have not been investigated and the sampling of quantitative data herein is quite limited and less than random, conclusions about quantitative data must remain suggestive and tentative. However, to the degree that those conclusions coincide with qualitative data, the possibility of validity can be inferred by considering triangulation to be a form of convergent validity between quantitative and qualitative approaches. Perhaps some of the hypotheses suggested will, with further study in other contexts, become theories that can be tested through the nomological network form of construct validity.
For full reference details, see my dissertation which is posted on ResearchGate.