I want to check internal consistency, face validity, content validity, convergent and discriminant validity of a questionnaire. What are the latest and best methods for determining these?
VALIDITY: Validity is defined as the degree of agreement between the claimed measurement and the real world. There are three categories of validity test, namely: (i) content validity, (ii) criterion validity, and (iii) construct validity. Content validity seeks to answer the question of whether the current test covers all relevant items needed to answer the research question. Criterion validity is the degree of correlation between the current test to the predetermined standard. The predetermined standard scores are those that had been tested by prior studies and had been held to be valid. Construct validity is the degree to which the test actually measures what the theory claims.
(i) Face Validity. Use the respondents to answer the question: Does the survey or test measure what it intended to measure? This is the subjective view of the respondents to the survey (not experts). Use this as a test-run before distributed the real survey.
(ii) Content Validity. Use expert panel to answer the question: Is the question or skills measurement int he test "essential" to the intended measurement? Form a panel of subject mater experts (SME) and then ask them whether your intended questions or survey is relevant to your intended research issue? Use the Lawshe test:
CVR = [(ne - N)-N/2 ] / 2
... where CVR = content validitt ratio' ne = number of experts in the panel answered "yes, relevant"; and N = total number of experts in the panel.
(iii) Construct Validity. There are two kinds of construct validity: (a) convergent validity and (b) discriminant validity. A convergent construct validity exists when what is expected to be correlated indeed turns out to be correlated, thus H0: r = 0 and HA: r not equal to 0. The result shows that H0 is incorrect and, thus, is rejected. Whereas, in discriminant validity, r = 0; H0 cannot be rejected. Use correlation coefficient as the unit of analysis.
INTERNAL CONSISTENCY: It appears that the Cronabach's alpha is a common test. However, Cronbach himself had recently admitted that the Cronbach’s alpha is not an appropriate test for reliability. Cronbach wrote that:
“I no longer regard the alpha formula as the most appropriate way to examine most data. Over the years, my associate and I developed the complex generalizability (G) theory. (Cronbach et al. (1963); Cronbach et. al. (1973); see also Brennan (2001); Shavelson and Webb (1991), which can be simplified to deal specifically with a simple two way matrix and produce coefficient alpha (Cronbach (2004), p. 403). Cited in N.M. Webb, R.J. Shavelson and E.H. Haertel (2006). “Reliability Coefficients and Generalizibility Theory.” Handbook of Statistics, Vol. 26, p. 2. ISSN 0169-7161.
Therefore, the use of Cronbach’s alpha must be reexamined. This is not to say that Cronbach’s alpha is not usable; its use and interpretation, however, must be modified. It is a tool to determine whether the response is consistent; this is different from asking whether the instrument produces consistent response?.
REFERENCES:
[1] Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16,297-334.
[2] Cronbach, L. J., & Meehl, P.E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281-302.
[3] Lawshe, C.H. (1975). A quantitative approach to content validity. Personnel Psychology, 28, 563–575.
There is no single best method and in fact there exists enough disagreement for a method proposed by one to be denigrated by another. Personally, I think that constructing a variable that is then tested by the application of survey research and the transformation of linguistic responses to exact numerical values that are precisely spaced such that 5-values per question is said to approximate the uncountably infinite normal distribution in an experimental design (usually) involving null hypothesis testing that is probably more questionable than standard methods for treating likert-scale data to be largely an exercise in futility.
With my bias distaste for NHST & likert-type data out of the way, the first step you might try is determining whether or not your questionnaire is analogous to similar tests already in use. Generally speaking, for any given variable or set of variables such methods seek to quantify (political orientation, religiosity, personality traits, mental health, etc.) there are related questionnaires that, while not identical, do test for the at least some of the same things. If such a questionnaire exists, that helps a great deal.
If you haven't constructed or have only begun to construct the questionnaire, cognitive interviews are a frequently used approach for developing questionnaires. There are akin to focus groups but use greater structure based upon a more developed theoretical framework.
Do many pilot tests. Mechanical turk is helpful here, and in early stages just getting friend and colleagues to take a look can be quite useful. You want to make sure that participants understand the questions and are answering what you are asking.
Perhaps the oldest and most popular method (for those who can rely on participants to fill out long questionnaires) for internal validity and consistency is having multiple questions measure the same construct. In pilot testing, if you find divergences in questions intended to measure the same construct then you have a problem.
How the questionnaire will be administered matters a great deal. For example, online survey software/services offer piping/pipe-line logic and other response-based navigation that paper questionnaires do not (e.g., you can ensure that if a certain response to a question in an online Qualtrics survey is given, another question might be asked, skipped, etc. Also, questions are viewed singularly (or can be) unlike with paper.
It's always good to apply statistical models to data garnered from pilot tests. In particular, certain quantitative methods, such as Item Response Theory, are designed for this.
I've attached a chapter on piloting questionnaires you might be interested in.
Andrew provides a very good response. Let me add that you did not provide sufficient information as to what you are trying to validate against. There are several different types of validity, each with a different emphasis.
I am doing something similar at the moment, and am considering these steps:
Checking internal consistency with Cronbach's alpha (for each sub-scale separately, if you have subscales that don't sum into an overall scale score).
Checking for items that correlate too much with other items (and are therefore possibly redundant).
Using exploratory factor analysis to extract a preliminary factor structure, looking through problematic items again and then cross-validating your scale using confirmatory factor analysis on a new sample.
VALIDITY: Validity is defined as the degree of agreement between the claimed measurement and the real world. There are three categories of validity test, namely: (i) content validity, (ii) criterion validity, and (iii) construct validity. Content validity seeks to answer the question of whether the current test covers all relevant items needed to answer the research question. Criterion validity is the degree of correlation between the current test to the predetermined standard. The predetermined standard scores are those that had been tested by prior studies and had been held to be valid. Construct validity is the degree to which the test actually measures what the theory claims.
(i) Face Validity. Use the respondents to answer the question: Does the survey or test measure what it intended to measure? This is the subjective view of the respondents to the survey (not experts). Use this as a test-run before distributed the real survey.
(ii) Content Validity. Use expert panel to answer the question: Is the question or skills measurement int he test "essential" to the intended measurement? Form a panel of subject mater experts (SME) and then ask them whether your intended questions or survey is relevant to your intended research issue? Use the Lawshe test:
CVR = [(ne - N)-N/2 ] / 2
... where CVR = content validitt ratio' ne = number of experts in the panel answered "yes, relevant"; and N = total number of experts in the panel.
(iii) Construct Validity. There are two kinds of construct validity: (a) convergent validity and (b) discriminant validity. A convergent construct validity exists when what is expected to be correlated indeed turns out to be correlated, thus H0: r = 0 and HA: r not equal to 0. The result shows that H0 is incorrect and, thus, is rejected. Whereas, in discriminant validity, r = 0; H0 cannot be rejected. Use correlation coefficient as the unit of analysis.
INTERNAL CONSISTENCY: It appears that the Cronabach's alpha is a common test. However, Cronbach himself had recently admitted that the Cronbach’s alpha is not an appropriate test for reliability. Cronbach wrote that:
“I no longer regard the alpha formula as the most appropriate way to examine most data. Over the years, my associate and I developed the complex generalizability (G) theory. (Cronbach et al. (1963); Cronbach et. al. (1973); see also Brennan (2001); Shavelson and Webb (1991), which can be simplified to deal specifically with a simple two way matrix and produce coefficient alpha (Cronbach (2004), p. 403). Cited in N.M. Webb, R.J. Shavelson and E.H. Haertel (2006). “Reliability Coefficients and Generalizibility Theory.” Handbook of Statistics, Vol. 26, p. 2. ISSN 0169-7161.
Therefore, the use of Cronbach’s alpha must be reexamined. This is not to say that Cronbach’s alpha is not usable; its use and interpretation, however, must be modified. It is a tool to determine whether the response is consistent; this is different from asking whether the instrument produces consistent response?.
REFERENCES:
[1] Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16,297-334.
[2] Cronbach, L. J., & Meehl, P.E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281-302.
[3] Lawshe, C.H. (1975). A quantitative approach to content validity. Personnel Psychology, 28, 563–575.
Dear Paul your answer is excellent and very didactic. I got the Lawshe reference but there are a difference between your CVR's formula and Lawshe's. Could you explain me further or point me to another references to understand what happen?
Cronbach test can be used if the questions ina questionarie have different number of options per question? Exemplo: question 1 - a), B),C),D),E); question 2 - a), B),C),D),E), F), G), H).
Very inspiring discussion. Can anybody suggest which stats tools to be used for pre testing or pilot testing? So far I use factor analysis for construct validity, but it requires big sample. If pilot test is conducted using 10-30 samples advised by many authors I read so far, which stats tool should I use?
What I understand is that pilot testing or pre-testing of a measure is particularly important in establishing the language, understandability and sensitivity of the instrument. It is usually directed toward finalization and fine tuning of the instrument or in avoiding any major flaws that may have gone unnoticed. I think using a robust construct development (example: 16. Gillespie A, Reader TW. The Healthcare Complaints Analysis Tool: development and reliability testing of a method for service monitoring and organisational learning. BMJ Quality & Safety. 2016;25(12):937-946. doi:10.1136/bmjqs-2015-004596.) Will take care of the content validity. Using population groups in construct development and taxonomy development is highly recommended.
I feel that were we to have a method through which could establish the psychometrics using a small population effectively, we would not need a final study. The pilot would've been sufficient in itself as an evidence of construct validity. So far as I know, both factor analysis as well as rasch analysis need adequate sample size. I feel pilot studies are more useful for the reasons that I mentioned earlier, however, I would like to know opinions of other experts here who might be able to throw more light on the subject.
Morning guys. I did some reading as well and came across Malhotra, marketing Research 6th Ed, and mention the same thing that pilot testing or pretesting is to test the wording and the understanding of the possible respondents. He also suggests that 6-11 respondents from the same population will do the job.
Hi there, can we find an association that are able to participate in a pilot survey study, I need information assurance association or cybersecurity professionals. Thank you on advance :)
Suresh Kumar, could you put more information (citation) about suggests that 6-11 respondents from the same population will do the job - where I find the proof of this information?
I think Rasch Measurement Model are the best methods for to validate a questionnaire based on a 5 point Likert scale. It can determine all these: to check internal consistency, face validity, content validity, convergent and discriminant validity of a questionnaire.
For questionnaire validation, if resources permit, you should carry out as many validation tests as possible to generate as much evidence as possible for your questionnaire to ensure its reliability and validity.