My study, involving 280 respondents, has 3 variables i.e independent (4 subsets), mediating (7 subsets) and dependent (9 subsets). Each subset has 3 to 6 questions. If Cronbach's alpha for some subsets falls below 0.7, normally the decision is to delete the relevant question in the subset in order to increase the Cronbach's alpha above 0.7. If this action results in reducing the number of questions to 2, could this present a problem? Is there a solution? During the pilot survey of 40 respondents, this problem did not arise.
The urban legend that an alpha of 0.7 is acceptable for a measurement scale is just that: an urban legend. It is chronicled in a very amusing paper by Lance (Lance CE, Butts MM, Michels LC. The Sources of Four Commonly Reported Cutoff Criteria: What Did They Really Say? Organizational Research Methods. 2006 Apr 1;9(2):202–20. )
Nunally's 1978 paper actually said "what a satisfactory level of reliability is depends on how a measure is being used. In the early stages of research . . . one saves time and energy by working with instruments that have only modest reliability, for which purpose reliabilities of .70 or higher will suffice. . . . In contrast to the standards in basic research, in many applied settings a reliability of .80 is not nearly high enough. "
And that makes more sense to me.
Dear Tze Leong Chan:
The minimal number of questions is two. But in this case you take a risk of losing content validity. In paticular, I recommend three ítems o more by subset, and 0.8 or more of alfa in each subset. In your case, maybe the best way is remove those subset completely.
What is the article/book referrnce for the minimal number of 2 questions? Alternatibvely, could the cronbach alpha be possibly increased in any other way while retaining the same number of maximum 3 questions ? I have so far included 99 into the missing data cells and there are quite a number of missing data.
The alpha is a score of the reliability of your measure. It is a characteristic of your scale. So, the unique valid way for increase the reliability of you measure is modifying your experimental desing. Maybe you have two ways. Repeat your research with a better experimental desing or report in the best methodological way your today data, sacrificing some of the research objectives.
http://books.google.com.mx/books/about/Foundations_of_behavioral_research.html?id=3QQQAQAAIAAJ&redir_esc=y
Specifically, I don't know a reference about that, I remember the opinions of my professors at doctoral program. Maybe you can search in a book about Test Classical Theory.
I agree that dropping to two items probably lowers the content validy of the measure. Alpha in a subsample is probably irrelevant. Reliability (here measured as internal consistency) is largely a characteristic of a measure, although arguably in a population, but not a sample or subsample. What's puzzling is that your first exercise had a smaller N. Could that sample have been more diverse with regard to what the subscales measure that the larger sample? That matters because under classical measurment theory the error part is random while the true score part varies with the level of the trait in the individual. So if the trait is restricted, the reliability will decrease because the proportion that is true score compared to error will deecrease. Honestly, the only problem with having a lower reliability is that you statistics will be biased in a particular dirrection because errors will be larger. So, for example, standard errors will be larger, so if there were a t-test the ratio would be lower than it should be and your p-value would be too large and therefore you would be less likely to find a "statistically significant" result. Bob
Yes, it is problematic to remove items solely on the basis of their contribution to the height of Cronbach’s alpha (or whatever index of reliability one uses). The result might be (probably will be) a biased test.
Given that Cronbach's alpha is essentially a measure of inter-item correlation, the minimum no. of items has to be two - no reference is needed. Ideally, you want more items not less, so avoid removing items for trivial gains in reliability.
If you are finding that a number of the scales show poor internal consistency and the scales overlap with regard to their theoretical content, then it is probably worth conducting an exploratory factor analysis to examine whether or not the a priori item groupings you have hold. If they do, then I wouldn't worry too much about the low Cronbahch's as it is probably due to having so few indicators per factor. However, if the structure of the items differs substantially from your a priori groupings, then this is probably the root of the low internal consistency.
I 'm concerned about two things: firstly, in a small sample the alpha's were sufficient, why was that the case because it is contradictive, secondly, what is your approach in dealing with missing values, when are subjects removed from your sample. Do some explorative analyses to find out what is the case before applying solutions.
Dear Tze,
what is the Intention of your Study? Do you want to Design an Instrument to then answer some Field - Research Question or do you use an established Test?
George and Mallery (2003) provide the following rules of thumb of Cronbach's alpha: > .9 – Excellent, .8 – Good, > .7 – Acceptable, > .6 – Questionable, > .5 – Poor, and < .5 – Unacceptable” (p. 231). So if your Cronbach's alphas are between .6 and .7 is not a big problem. Certainly, deletion of the relevant questions in order to increase the Cronbach's alpha can lead to a risk of losing content validity. In my opinion is worse because you are not sure that the scale really measures the hypothesized construct. If you use SEM, could be useful test the composite reliability index (Rho) that is, according some authors, more sensible of Cronbach's alpha.
When you study Kline's book on Structural Equation modeling and the needs for specifying and identifying properly your model you understand better the issues when your are dealing with a scale were you have only two items per subscale, subtheme, or dimension. Yes this is definitely going to bring you reliability problems to your subscales and also it would be questionable the precision of your data in your final model (which looks like a complex one depending if you are going to be including the subsets in the model individually).
It is not clear for me how you create or develop your subset of variables each with 3 to 6 items. By what your are saying it may look that it did not follow the proper procedure, but again it could be that we dont have enough information. It is not clear if you create all this subsets of variables or if they were measures already develop by someone else. It is too many things going on in here, one is how you operationalized your variables and if they are going to provide you the precise information you need for your model, were they properly developed or would need to be refined? Did you follow the proper procedure for developing scales? (Robert Devellis has a good book on Scale Development). The quantity of items per subset (as you call it) you may need initially is supposed to be larger knowing that it is highly probable you would need to drop some of the items that evidence not to work properly. We do not know what was what you did with your 40 pilot study, if you run at least an exploratory factor analysis (for which you did not had enough sample with 40 participants), a Rasch Model, or some other analysis that would be telling you how your subsets and your items for each subsets were behaving in your measure. Running an alpha is not enough when you are developing a measure in order to be sure that your construc are properly measure by your items. Frequently we take for granted that if we have a relatively good reliability it would give us validity of the measure when that do not neccesarily happens. I am sorry, but I have so many questions with your subsets that talking about changes in your model when I am not sure if you have operationalized your variables properly is secondary. Yes I believe you are going to have problems in your data, as Bob said you are integrating to much noise (error) to your model with low relabilities, that without considering the error terms that you already include in a model by the variables or factors not considered in it. I am not an expert on this subject, but from the little I know I think you have problems and you may need a statistical consultant to give you direct advice on your study.
Hi Tze,
Here is some topics that may help you in your decisions:
1) Cronbach's alpha is a good measure of internal consistency of the latent variable, and acceptable values are normally above .70 (Nunnally, 1978). However, we can accept values near of .60 (Hair, et al., 2006), especially if the factor have only few items. You can guide you decision according the following: unacceptable < .60; poor .60-.69; acceptable .70-.79; good .80-.89; excellent > .89.
2) Is possible to have only 2 observed variable to measure 1 latent variable, but this situation can lead you to some problems of model identification in future (Brown, 2006; Kline, 2005). So, is adviced to have a minimum set of 3 items, if is possible have 4 is better (Hair, et al., 2006).
References:
Nunnally, J. C. (1978). Psychometric theory. New York: McGraw-Hill Inc.
Hair, J., Black, W., Babin, B., Anderson, R., & Tatham, R. (2006). Multivariate Data Analysis (6th ed.). New Jersey: Pearson Educational, Inc.
Brown, T. (2006). Confirmatory Factor Analysis for Applied Research. New York: The Guiford Press.
Kline, R. (2005). Principles and Practice of Structural Equation Modeling (2nd ed.). New York: The Guilford Press.
I also recommend the following articles:
Worthington, R., & Whittaker, T. (2006). Scale Development Research. A Content Analysis and Recommendations for Best Practices. The Counseling Psychologist, 34(6), 806-838.
Hi Tze,
I agree with the idea that two item will not represented content of the construct. If I am not mistaken cronbach alpha will not increase by adding the respondents, but adding your igem. Nunnaly 1978, Kaplan and Saccuzo, 2005 also said that there is a way to increase your alpha cronbach. So may be you need to read these twoo books. I had the same experience and I did what kaplan and saccuzo said to increase the alpha....and it's work.
Good luck with your research.
Regard
Ayu
The urban legend that an alpha of 0.7 is acceptable for a measurement scale is just that: an urban legend. It is chronicled in a very amusing paper by Lance (Lance CE, Butts MM, Michels LC. The Sources of Four Commonly Reported Cutoff Criteria: What Did They Really Say? Organizational Research Methods. 2006 Apr 1;9(2):202–20. )
Nunally's 1978 paper actually said "what a satisfactory level of reliability is depends on how a measure is being used. In the early stages of research . . . one saves time and energy by working with instruments that have only modest reliability, for which purpose reliabilities of .70 or higher will suffice. . . . In contrast to the standards in basic research, in many applied settings a reliability of .80 is not nearly high enough. "
And that makes more sense to me.
IF you have a scale of 3 items, and one of the items has a low enough item total correlation to lower the cronbach alpha, I'd say you have a scale of only 2 items anyway... which is not really a scale, but a bivariate correlation -- an underidentified construct. The best solution would be to add more items.
Are these theroetical based scales or empircally based scales derived from an exploratory factor analysis. If its based on a factor analysis you might want to consider reducing the number of factors extracted and drop factors that are defined by less than 4 items (5 would even be better). Note a sample of 40 is likely too small to derive the true factor structure.
Note that alpha was actually designed for continuous variables and will be underestimated (attenuated) if the number of levels of response options of the scale is small (e.g., 2, 3 4) .
Ronán is right on the mark. I suggest you examine the average inter-item correlation - depending on your construct, this value should fall somewhere between .2 and .5 (with larger values for more narrow constructs). Also note that in some cases, a high coefficient alpha is simply indicative of redundant item content.
Also see: Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in objective scale development. Psychological Assessment, 7, 309-319. doi: 10.1037/1040-3590.7.3.309
Good Luck!
Hi Tze,
As Antonino Callea and Luis Cid have cited the alpha of .7 is not a big issue and for research purpose it is in acceptable or satisfactory range.
However, getting a better reliability estimate on small sample (N=40) and a lower value with larger sample is an issue.
Though many good suggestions have already been given, I would like to share my views on this issue.
1. First you should try to find the reason why with large sample the reliability values are going down. As alpha coefficient is an index of internal consistency, a single case that show a deviant (inconsistent) pattern of responding may lower the reliability.
Thus, my suggestion is to run a test of multivariate outliers using all the items as predictor (exclude total and sub-total scores). I hope, after removing such cases the problem may be solved.
If the problem of low reliability is not solved by the said approach, then you may also try to identify unusual cases using the SPSS option for this and may drop a few unusual cases (if identified). Then re-run the reliability analysis with the slightly reduced data set.
I am very hopeful that after removing outliers and unusual cases the problem will solved to a larger extent.
2. If the problem remains even after trying the aforesaid procedure then you may try to estimate the correlations between all possible pairs of variable after correcting for unreliability of the measures. In any standard psychometric/psychological testing textbook you may find the way of doing it. Such methods provide opportunity to estimate the correlation between two constructs assuming that the measures of the construct are perfectly reliable.
After correcting the correlations for poor reliability of measures, you may use this new correlation matrix as an input file for testing your models. For instance, if you are using the SEM with AMOS there is provision to use correlation matrix as input.
The following link may be helpful in how to prepare the correlation matrix for the AMOS.
http://amosdevelopment.com/support/faq/enter_sample_correlations.htm
Thus, the model tested with correlations corrected for poor reliability will give an idea of how the constructs relate given that their measures are perfectly reliable.
3. I would not recommended to reduce the items further for dealing with problem of low reliability.
Personally I find that Cronbach's alfa is only relevant when you lack a good theory so you in stead use a cloud of related questions or psydodimensions; you use it do be sure that there are no totally weird and strange Qs in the set. In general you should never work without a good theory that determines what you need to ask; in that case Cronbach's alfa is not a relevant test at all.
Couple of quick points to add to other comments
An assumption of Alpha is unidimensionality, hence the SEM options from above would be an appropriate test of this assumption as well as providing a reliabilit estimate
Alpha is the lower bound estimate of reliability, the SEM options tend to provide higher estimates
Alpha is dependent upon the number of items, the more items tha higher alpha can become with moderate inter item correlations, so a few items tends to produce lower estimates
a article i have found useful is
Cortina 1993 What is coefficient Alpha J of Applied Psychology
If your questions use a likert scale, I suggest that you use an ordinal alpha instead of the traditional Cronbach's (http://pareonline.net/pdf/v17n3.pdf ). It might automatically solve a part of your problem because Cronbach's alpha tend to underestimate the reliability with few items and with ordinal scales.
'One-size-fits-all' guidelines about 'good' alpha coefficients are misleading. It depends whether you are trying to finely differentiate your measure from others (a 'high fidelity' measure) or are aiming to capture a wider range of reactions from respondents (a 'high-bandwidth' measure). If the latter, an alpha coefficient in the middle range (say .5-.6) is perfectly respectable - indeed, anything higher may indicate that you are capturing a construct that is too specific for your needs. So it depends on your theoretical aims. Also, there are other criteria that need to be examined, some of the more technical of which are suggested by others above, but also consider construct validity in terms of what other variables your measure correlates with.
hi Tze, I think I agree with Samuel McAbee, that 'in some cases, a high coefficient alpha is simply indicative of redundant item content', alpha score doesn't indicate that the instrument has a high determinant power, and also we do not really know that the instrument will suitable for the samples. i suggest you to try item analysis with item response theory approach. it allow us to choose the suitable items for the samples and based on the aim of measurement, such as whether it is for diagnosis or maximum performance test (for example in the recruitment test), and so on.
best wishes
Have you ever tried to use some kind of multivariate data analysis as foctor analysis in your data? This way you will be able to observe the unobserved variables that are behind your actual variables.
I beleive that this is the better way to look at this kind of data.
After going through the responses of various esteemed members posted within the last 24 hours or so, I would like to share the following. However, before it I would show my agreement with the view of one member that one common standard can not be adopted for judging the reliability or validity of measures. It depends on the objectives and goals for which the test is being used.
My other observations are listed below.
1. Some members have suggested that average inter-item correlation is better than Alpha. I would say that though the former is not influenced by the number of items whereas the later is, the alpha is logically the Spearman Brown Prophecy Correction of the the average itner-item correlation in which the N is the total number of items. The alpha is dependent on average inter-item correlation or is a function of it. Thus, both are not going to provide discrepant information.
2. Some members also say that high alpha may be indicator of redundant items. I agree with the remark that it may occur only when one has not taken due care while the development of the scale for similarity of wording or content of items. If one has included several items in a scale that ask almost the same thing in different words only then it is likely to happen. So, my remark is that such situation is less likely to occur if one is using standardized and well validated measures. However, with such measures also this redundancy of items may not be completely denied. To rule out one may check the item contents of such scales to ensure whether this problem is there or not.
3. The most important is that high reliability or validity of a measure does not ensure that it is highly discriminative also. The major goal of any test is to tap or measure the individual differences or variations and this information is not contained in reliability or validity. The reliability tells that respondents are consistent in respondent to items of scale over the content or time variation. The validity informs that the items and the scale (as a whole) is sensitive and accurate in measuring what the test claims to measure. But all these information can not ensure that it will also reliably differentiate the individuals who vary on the trait being measured.
Thus, my remark is that when one is very much concerned about the psychometric properties and adequacy of the measures used in his/her research then he/she may also try to assess the discrimination ability of the measures u)sing suitable indices. One such index has been proposed by Furguson long back (for scales with dichotomous items) and this formula has been extended for polychotomous scales by one esteemed member of the RG. The index is called 'delta'. I created a thread related to one of the paper of this honored member to know how this can be computed using some software or Excel spread sheet. The link to this thread appears below.
https://www.researchgate.net/post/Is_there_any_software_or_SPSS_macro_to_compute_delta
Finally, I would say that any common or gold standard can not be provided for evaluating the psychometric property of a measure as it involves more than the conventional and commonly used indices of reliability and validity. The psychometric quality of a measure depends upon and is indicated be several other parameters also such as beta coefficient (using i-clust package of R one can get this information), sensitivity and specificity of the measure, differential item functioning, the dimensionality of the measure etc. However, it would not be a practical recommendation to all the known indices and indicators of psychometric adequacy of a measure in one research. It all depends upon the goal and objective of the research what aspect or psychometric property of a scale is more needed and what standard or criteria be followed to make decision about the psychometric quality of the measure. For instance, in use of diagnostic measures the trade-off between sensitivity and specificity is an essential component in addition to reliability and validity. In some cases a high sensitivity becomes important while in others the specificity is give more emphasis.
Many good suggestions. Please note one thing. Because you are testing a mediation, a lower reliability estimate could be a problem. Especially for mediation, you want to use highly reliable tests. Or you may use SEM to test your hypothesis to avoid reliability issues, although measures with lower reliability estimates cannot be saved with any statistics. Reducing the number of items seems to be a better alternative than using all the items with a lower reliability.
Jisoo,
It is a bit counter-intuitive saying that "psychometrically sound items" and low reliability estimates can coexist. If items in a measure show less than .7 coefficient alpha, more than half of the variance is decided by errors. I am curious how such items can be psychometrically sound. Items retained because of content validity concerns are more likely to lower psychometric properties of the measure because of the heterogeneity of the items. In the testing of mediation, however, the reliability of the scores should be of primary concern rather than the coverage of the items in the measure.
Although I agree that reliability is not a stable quality of a measure, I disagree that is is a property for a particular sample. It is a property for a particular population. Statistics are estimates from a sample about a population. Cronbach's alpha is an estimate like any other statistic we ordinarily use. Another sample from the same population might give you a different result. Unless the samples are large, it's more like probably will give you a diffeent result. I also don't think that classical test item analysis are really the tell all about what is psychometrically sound. They are sometimes helpful indicators, but definitely the last word. A low reliability tells you that you have a lot of measurement error, which under classical theory should be random, but that's pretty much all you know. If it is truly random measurement error all that's going to do is underestimate your relationships versus something that contains more true score and less error. Because reliability is easy to do, we have put way too much emphasis on it and way too little on other qualities of a measure. It's easy to get high reliability by essentially posing the same question several times. I'd rather have a better representation of content with a more modest reliability than a higher reliabiltiy and a poor sampling of content. Bob
Hello,
I think this article could help you. Jose M. Cortina (1993). What Is Coefficient Alpha? An Examination of Theory and Applications. This study was published in Journal Applied of Psychology.
It could help: Jose M. Cortina (1993). What Is Coefficient Alpha? An Examination of Theory and Applications. Journal Applied of Psychology.
I agree that reliability is computed for a score (of a measure) obtained from a specific sample and because of sampling error the estimate of reliability is also likely to vary. However, estimates obtained from a representative and relatively large sample can be generalized to population. The standard error of measurement may be helpful in making such decisions. Further, the scores obtained through any test or scale are indicators of the construct being measured and thus the reliability of scores (of a measure) may be considered as an index of reliability of the scale or measurement instrument. The only problem is the degree to which it can be generalized. I agree that it will never remain stable as sampling and measurement errors are inevitable. However, if the reliability estimates have been obtained from a large and representative sample then it is very likely that such fluctuations in the estimate of reliability for different samples drawn from same population will be minor and thus may be considered a relatively stable estimate of reliability of the measurement tool.
If your research is intended to give a general description of the characteristics of a population and have a criterion test the validity of their measure. You can take your chances with low reliability. But if your research is evaluative and their results will affect the life of some subjects, for example, that obtain a measure below a certain cutoff score, then high reliability is essential, perhaps an alpha of .900 or higher.
no promlem of reducing the number of questions to 2 and next you will try out in 30 or 40 respondents for clear of Cronbach alpha below .70 or more .70
A reason we emphasize reliability estimates is because it sets the upper limit of the validity of the measure. Validity can be shown in different ways. One of them is through having representative items (i.e., content) as several commentators mentioned. Psychological constructs and most concepts used in social science, however, tend to have narrower content area (i.e., narrower bandwidth). This is where coefficient alpha might be most useful with not many items (as in the original question) and a unidimensional construct. For concepts with lower bandwidth, content-oriented approach may not provide a good/practical way to show the validity of the measure. Rather other types of validation approaches (criterion-related or convergent/discriminant) may provide more appropriate means. So, prioritizing content to reliability estimates doesn't make much sense in this situation. Modest to lower reliability estimates are not only about lower statistical estimates. They should be understood as a precondition for a valid measure. Lower reliability estimates mean probably you are not measuring what you wanted to measure.
It may be useful to note that the interpretation of alpha (and reliability in general) depends on a classical measurement setup (multiple items measuring the same "true score" with independent errors). In such a setup it is desirable to design tests with sufficiently high coefficient alpha.
If the construct that is intended to be measured, is not of this kind, reliability may be less relevant. If, for example, you would expect the Items to reflect some different aspects of what you intend to measure, then you would expect reliability to be lower, as the "true scores" of the "latent variable", which each Item measures, vary. Still the measure may be useful. Imagine something like a checklist, say for used cars. You may have binary Items like breaks ok, engine ok, electrics ok, etc. The Items may be independent, still the sum score would be a useful measure for the quality of the car.
For many psychological/ QOL / etc. constructs it is not evident why something like a "true score" of some underlying variable with multiple indicators should exist at all. Consequently I would not be too concerned about suboptimal alpha, as long as there is sufficient reason to belive that the score reflects something useful and proves to be valid.
Andreas,
Please let me clarify. It seems that you are questioning the True Score concept in classical testing theory. Even if you do not assume such a true score, a lower reliability means that your scores can vary depending on situations. It should be noted that alpha is just one reliability estimate that is useful in a certain situation. In other situations, different reliability estimates will be more appropriate. What you described was one of the 'other' situations. And in such a situation, a lower alpha may just reflect that you need to use a different reliability estimate. The situation presented here is where alpha can be most useful.
I share Andreas' concerns. Alpha has no interpretation if the scale is not unidimensional (and is often mistaken for a measure of unidimensionality, which it is certainly not). Some scales, as Andreas points out, measure domains that are multidimensional. In this case, representing someone by their total score is reducing the measurement space so a single dimension. Indeed, a high alpha may simply be simply be an indicator that you have tapped a very narrow region of the construct domain. If you take a depression scale like the Beck, it measures domains of mood, cognition, and physical signs such as loss of appetite, sleep disturbance. If you removed all but the mood items, the reliability would apparently increase, simply because you have failed to cover the construct domain adequately.
Ronan,
I don't think there is any disagreement with your argument. But, I believe such a concern is not applicable in Tze's situation. He is not talking about a broad construct. His question was regarding lower alphas for SUBSCALES that have 4-9 items per each.
One more thing to note is that most researchers using multidimensional scales use a wrong version of alpha. When your measure has multiple subscales, composite alpha (treating each subscale as an item; Helms et al., 2006) is an appropriate estimate. Typically, the composite alpha will give you a much lower value than the alpha using all items together because it reduces the effect of the large number of items.
I agree with Ronan, though my point is not essentially about multidimensionality. My point is that the measure may be intended to measure a composite of some latent variable that is common to all variables and aspects that are unique to the single Item. Alpha is (at best) an indicator of the precision with wich the common part is measured. In the classical measurement setup everything individual would be attributed to noise. In a case where you are interested in a composite (in this discussion sometimes called broad bandwith) may not be interpreted in the classical sense. Lower values may be acceptable depending on the proportion of shared vs. individual variance that is part of the "signal".
Inthe best of all worlds it would be wise to find replicate items that share the individual component, so that they may be used as indicators with some hierarchical factorial structure.
If Tze actually has independent subscales (which would indeed strengthen a more classical interpretation, as it restricts the amount of individual Item variance that may be part of the signal) is not clear. To me it sounds as if he may have 3 sets of items, intended to be used in a mediator analysis, which may be self constructed. In that case such consideration about the nature of the construct and the adequacy of the application of classical test theory may be useful, especially if he, as i assume, wants to use the data to answer content questions and not for psychometric fine tuning. Thus he may be bound to the items he has, which may be of such composite nature, but may be still useful measures for the intended constructs.
A problem in this situation is, of course, that he then would lack a useful estimate of measurement precision.
Simon,
rereading the initial question you may be right. The Variables|Subsets|Questions structure is not entirely clear.
Tze, could you maybe give examples for the variables, subsets, questions structure, including a description of the answering format?
Do you maybe have 4+7+9 variables represented by 3 to 6 items each, which, in your study, may have the roles of independent, mediator, dependent variables?
Hello
When I read the question and responses just now (sorry a little late) I wanted to know about factor analysis and whether the subscales were identified with that method. Then I wondered if a KMO of sample size adequacy was satisfactory. Also, the comments that an internal consistency alpha on a subscale with 3-item scale is fine for a construct. In fact, I would wonder if the items are too strongly inter-correlated if it was higher.
The concerns with latent structure and validity are crucial and can highlight limitations in what we think we are measuring with an index such as "reliability." Most indices are ad hoc and make strict assumptions. I find the following paper informative:
Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach's alpha. Psychometrika, 74(1), 107-120.
In addition to other responses remember that Cronbach's alpha is really the calculation of how consistent a set of items are answer by a group of people. Why this group did or did not consistently respond to the set of times can be due to multiple factors. A high reliability can be obtained when the set of items are consistently answered but for the wrong reasons. response sets may impact scores. For example people consistently answers a set of Likert scale items raging from 1-5 as 3 (or 5, etc) but did not even read the items. The Cronbach's alpha will be high, but not for a good reason. Low Cronbach's alpha does mean that group of people did not respond that set of items consistently. What meaning we ascribe to such a metric comes from our theory and what we believe about it. While it does not measure heterogeneity of a construct, it can be affected by it. It may be that a subgroup responded differently to the items than other subgroups. All we do know is that this group of people did not respond consistently to this set of items. Other measures of reliability/consistency may be more crucial to the construct being measures. Perhaps stability over time (test-retest reliability) is more theoretically meaningful.
In CTT "the value of a reliability estimate tells us the proportion of variability in the measure attributable to the true score. A reliability of .5 means that about half of the variance of the observed score is attributable to truth and half is attributable to error. A reliability of .8 means the variability is about 80% true ability and 20% error. And so on." (http://www.socialresearchmethods.net/kb/reliablt.php) . How much of this is going to be acceptable in your context? It relates directly to the standard error of measurement of the tool. "A satisfactory level of reliability depends on how a measure is being used. In the early stages of predictive of construct validation research, time and energy can be saves using instruments that have only models reliability, e.g. .70." (Nunnally & Bernstein, Psychometric Theory 3rd ed, p 264). In some cases lower reliability be be warranted in others not so - in applied setting a guideline is generally that tests used in admissions or selection decisions should have reliabilities =/> .90 with =/> .80 are desirable in most other tests. However as a researcher you have additional wiggle room.
In any even the questions are: How are you using the measure? (Clinical - research setting?); What consistency measure is most meaningful theoretically?; How does this one piece of information fit into the full array of information about he tool you have (psychometric context)?
Some other resources:
- Various other texts on Psychometric Theory such as Furr & Bacharach (2008), Psychometrics: An introduction.
- http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&ved=0CDcQFjAB&url=http%3A%2F%2Ftesting.wisc.edu%2FReliability.pdf&ei=iTpUUbCdJOOEygGIpICADA&usg=AFQjCNFturVRNzUGHrHc7_bcldjQ3rBGyQ&bvm=bv.44342787,d.aWc
I agree with Dale Pietrzak that alpha is an indicator of average inter-item response consistency and it may be influenced by a host of factors of which some may not be desirable for a given research. I would like to add that this applies to other forms of reliability indices too. Further, sometimes a very high reliability index may be an indicator of some problems. For instance, in case of alpha (as Dale Pietrzak has pointed out) that response set (e.g. central tendency) may produce a high reliability (alpha) but in this case it may be a poor indicator of the proportion of true variance in the total variance. Similarly, a measure with very similar item content is also likely to yield a very high alpha but in this case it is an indicator of redundancy of the item content and not the reliability of the measure.
Even if we use other form of reliability (say test-retest), the higher value may sometimes be not desirable. For instance, if have developed a measure of state dependent phenomenon that by definition is likely to vary from moment to moment, then in this case a low test-retest reliability is a desirable outcome. If the test-retest reliability in this case comes to be very high then it questions the validity of the measure (may be of the construct also).
Earlier, I have shared my views on this issue and the ways we can handle the situation. In those posts I discussed the issue of genralizability of the alpha coefficient, use of other psychometric indices such as discrimination index, adjusting the correlation between two measures for poor reliability and many more that I hope may be helpful in dealing with issues raised in this thread.
Before any answer makes sense, we need to know why you are interested in Cronbach's coefficient alpha. Are you using this measure as an estimate of reliability/ If so, it will only be an estimate of reliability defined as the proportion of the observed variance that is attributable to variance in true scores under the assumption that the observed score is the sum of a true score and random error. Under these conditions, Cronbach's coefficient alpha can be treated as the lower bound of test-retest reliability. This, as I recall, is the basis for many arguing that it should be at least 0.7 However, Cronbach;s coefficient alpha, in itself is a measure of internal consistency of the items in your test. Two items can give you a coefficient value of 0.9, for example. This would be the lower bound of test-retest reliability for the two item test, if the classical true-score model holds. Note, however, that this says nothing about the validity of the test.
While this is certainly morphing the conversation some, as Rakesh indicates the need to understand what the reliability you need to look at is, and why that value is what it is is important. As we have covered various elements of internal consistency the test-retest might add to the discussion. A high test-retest reliability is only desirable if the construct being examined is theoretically stable. If one is looking at daily fluctuation in some aspect of affect affect by definition it should vary over a day. It would be questionable to have a higher one month test-retest reliability as it does not fit with that construct. On the other hand if you are examining mood and t is theoretically pretty stable over a month's time it would be good to have a high test-retest reliability. "Good" for any form of reliability is construct dependent.
In terms of which to look at it depends on your use. You use it within CTT this calculate your standard error of measurement. If a At the time of testing (concurrent diagnostic issues) is your use internal consistency my be your most informative while if you are trying to predict something in the future test-retest may be more informative to the question. The guidelines are very useful, but as with all guidelines there is a need to understand the context.
Having only two questions would likely be problematic. It is recommended in Hair et al's Multivariate Data Analysis: A Global Perspective (2010) that you have "“a minimum of three items per factor, preferably four, to not only provide minimum coverage of the constructs theoretical domain, but also to provide adequate identification for the construct."
In relation to the Cronbach's Alpha value, however, Ronán Conroy's point makes a lot of sense. I think it depends on the value you're getting, again Hair et al say that “values of .60 to .70 deemed the lower limit of acceptability”, and as far as I recall Kline's Handbook of Psychological Testing presents a case that values lower than 0.7 for Cronbach's alpha are acceptable, although I don't have it to hand.
The general guide to the number of items for a scale depends somewhat (not entirely) on the method used for test construction. In CTT, which is where internal consistency like this is often used, the number used as a guide is often in the 10-16 range. See the Nunnally et al. reference cited before. In Latent Trait/IRT models this number may be fewer. I think Hair et al. does not really have a section on psychometrics or test construction that I recall, I would guess they are discussing Confirmatory Factor Analysis in the SEM chapter. I think that is where the discussion about three measures per factor took place but it has been a while since I taught from the text.
The three items per measure in SEM comes from a number of considerations, first to obtain a sufficiently broad conceptual basis for the latent factor a number of items is required,
second three items is the minimum to obtain an identified solution statistically in a one factor congeneric analysis, 4 is required to have at least one free degree of freedom
Two items is possible in a multy factor solution.
I ascept of Robert Brooks and you will review the multy factors and monotor
In SEM these are not usually refered to these as items. They are typically called indicators.
No existing method of finding reliabilty is isomorphic to the defination of reliability, Thus, no existing methed can find reliability. What they find is soomewhat different from reliability. There is a need to find reliability as per defination preferably from a single administration of the test. I am currently working on it . Anybody working in this direction may please contact me
S.N.Chakrabartty, Galgotias Business School, Greater Noida, India
In addition to the above answers I propose to look at the article I wrote for some other measures and procedures to work with validity, reliability and consistency of measurement instruments. Most appropriate in your situation, I guess, is the use of categorical principal component analysis and nonparametric IRT. The CatPCA will provide you detailed information about the measurement of the response categories of each item in relation to the other items and the items as a whole given all the scores of the respondents. Nonparametric IRT will provide you detailed information about the structure in your dataset whatever your theoretical construct is.
Haspeslagh, M., Eeckloo, K., & Delesie, L. (2012) Validation of a new concept: Aptitudes of psychiatric nurses caring for depressed patients. Journal of Research in Nursing, 17(5), 438-452.
@ Satyebdra, would you please share the definition of reliability on which you are working. Psychometrically (as per the CTT), it is defined as the proportion of true variance OR 1 minus the proportion of error variance. Statistically, both are equivalent if one assumes that the total variance is the sum of true and error variance.
If you are also following the same definition, then I would say that there are methods to estimate reliability as per this definition. I recall that in the Guilford's {Psychometric Methods" some such methods have been discussed. One is the ANIOVA based approach and the other is based on the difference of two set of scores.
3 Recommendations
4th Apr, 2013
Azubuike Victor Chukwuka
National Environmental Standards and Regulations Enforcement Agency (NESREA)
According to de Vaus (2002) the smaller the number of items the greater the likelihood of the reliability analysis using Cronbach's alpha to be inaccurate.
2 Recommendations
5th May, 2013
Imam Salehudin
The University of Queensland
I use Cronbach's alpha strictly for testing instrument reliability in the pre-test phase ONLY.
When you have collected a significant number of response, it is better to estimate the measurement errors directly using Confirmatory Factor Analysis for the final data set. Item reliability should be calculated as 1-(estimated error); while a cut-off point of 0.7 works for me. Construct reliability should also be calculated using THE formula (I don't know how to write it here).
7 Recommendations
5th May, 2013
Rakesh Pandey
Banaras Hindu University
Dear Imam
You can write the description of the method including the formula and steps in word file and attach it with your comments. I hope it would be helpful to those following this thread.
1 Recommendation
5th May, 2013
Imam Salehudin
The University of Queensland
@ Rakesh: Thanks for the suggestion. I'm attaching the formula here:
8th Aug, 2013
Azubuike Victor Chukwuka
National Environmental Standards and Regulations Enforcement Agency (NESREA)
If the number of questions reduce to two to increase the alpha above 0.7 it may not really be a big enough problem if your sample size is considerably big. Alpha below 0.7 are often advised to be interpreted with caution.
8th Aug, 2013
Don Bacon
University of Denver
I'm not too concerned with reliabilities below .70, as long as the sample size is large. Low reliability reduces statistical power (see Bacon 2004 below). If you have null findings, reviewers are more likely to see the low reliabilities as a possible problem.
That being said, some reviewers just have to have .70 or they reject the paper. Imam Salehudin's suggestion of CFA is a good one. In CFA, it may be appropriate to use other measures of reliability that take into account the different loadings of the measures. These other measures may be higher than Cronbach's alpha (see Bacon, 1995).
Bacon, D.R. (2004). The contributions of reliability and pretests to effective assessment. Practical Assessment, Research & Evaluation, 9(3). Available from http://PAREonline.net/getvn.asp?v=9&n=3.
Bacon, D.R., Sauer, P.,& Young, M. (1995). Composite reliability in structural equations modeling. Educational and Psychological Measurement, 55 (3) 394-406.
2 Recommendations
8th Aug, 2013
Alberto Mirisola
Università degli Studi di Palermo
As commented previously, Cronbach's alpha has several problems. An alternative should be to compute the more robust omega index.
Take a look here for a good paper on this topic:
Dunn, T., Baguley, T., & Brunsden, V. (2013, in press). From alpha to omega: A practical solution to the pervasive problem of internal consistency estimation. British Journal of Psychology.
1 Recommendation
8th Aug, 2013
Don Bacon
University of Denver
The use of coefficient alpha or coefficient omega depends in part on how you are using the scale, and if you are using structural equation modeling. Omega may be most appropriate in structural equation modeling, but not in regression. See the Bacon, Sauer, Young (1995) cite in my previous post.
2 Recommendations
9th Sep, 2013
Gregor Socan
University of Ljubljana
@Alberto: I do not agree that omega is more robust than alpha. Omega requires the good fit of the (one-)factor model, while alpha requires only uncorrelated errors when interpreted as a lower bound. But more importantly, the choice between alpha and omega depends on the psychometric paradigm you work with: classical test theory-> alpha, latent trait theory-> omega.
2 Recommendations
12th Dec, 2013
Gregor Socan
University of Ljubljana
I don't understand the statement that “omega is more realistic than alpha”. In what sense realistic? In my view both coefficients are quite different animals. Omega is based on a latent trait approach, and it would give a precise reliability estimate in case of perfect fit of the congeneric model (that is, one-factor model + zero specific variance, which is never achieved in practice of course, and cannot really be tested anyway, unless you have longitudinal data). Alpha is based on the CTT (which is model-free approach, a tautology one may say) and is a lower bound to reliability, subject to some mild assumptions. So one might claim that alpha is more realistic than omega, because it is not based on a quite restrictive model.
About Guttman’s lambda-2: it is well known at least from Lord and Novick (1968) that it is always higher than lambda-3 (=alpha), but in my experience the difference is usually marginal unless there is a substantial number of negative covariances. But in such circumstances it is a better choice than alpha indeed .
Alpha has several appealing properties (apart from simple derivation and calculation): its sampling behavior is relatively well understood, its sampling bias is negligible, and it bears a close relationship to intraclass correlations and generalizability theory.
All in all, I do not think that alpha has many real shortcomings when chosen/used properly. We should just not forget that is *a* measure of reliability, not *the* measure of reliability.
1 Recommendation
12th Dec, 2013
Luis Fernando Díaz Vilela
Universidad de La Laguna
Revelle has an illustrative comparison of internal consistency estimates of reliability in https://personality-project.org/r/book/Chapter7.pdf. Especially in pages 227 and 228 you'll find a clear view of alpha and omegas (hierarchical or total) meanings.
Following this, omega is more realistic because relative variations between hierarchical and total give you an idea of the underlying structure: High hierarchical and low total when only one component is present; low hierarchical and high total when several factors arise.
I clearly prefer omegas over alpha, but with just two or three variables it looks like "hunting flies with canyons". In this case I guess it is better to report the coefficients of determination or R squared.
2 Recommendations
12th Dec, 2013
Ronán Michael Conroy
Royal College of Surgeons in Ireland
Luis : this book is a fascinating and invaluable resource. many thanks! I still think, however, that these are measures of the performance of a test score in a specific sample under particular test conditions. There is a real lack of research into the generalisability of these estimates, whether they be alpha or omega or something in between…
12th Dec, 2013
Gregor Socan
University of Ljubljana
@Luis: From what you say about omega, it still does not follow that omega is more realistic than alpha (realistic in sense of representing the true state of reality). It only follows that omega may give you some additional insights into the data – of course, alpha can not and should not be used in sense of estimating the fit of a unidimensional model. On the other hand, I see no reason to investigate dimensionality by comparing variants of omega. After all, omega is based on factor analysis, and the inspection of the FA results will clearly tell you whether the one-factor hypothesis is tenable or not.
But I think we are discussing the wrong issue here. Questions like “Is omega better than alpha?” make sense only if you consider CTT to be a special case of the linear factor model (like McDonald, 1999). On the other hand, if you consider these two approaches to be fundamentally different (from the philosophical, formal, and practical perspective; see for instance Borsboom, 2009), such questions are just a case of mixing apples and oranges. Although both viewpoints are IMO legitimate, I personally subscribe to the latter view and therefore see little sense in comparing both coefficients in an evaluative manner.
2nd Feb, 2014
Christian Gaden Jensen
University of Copenhagen
Excellent discussion, guys, thank you. I am validating a new verbal memory test with 8 items in each subscale (positive, negative, neutral words, respectively). We have tested 137 healthy people for the first paper. How would you recommend looking at the internal consistency of each scale - since it only has 8 'items'? Thanks.
2nd Feb, 2014
Kenneth A Wallston
Vanderbilt University
Christian: What are the items that make up each of your sub-scales? Are they words? Phrases? Statements? Because it's a memory test, I assume that the sub-scale scores will be the number of items of each type that are remembered "correctly." Is there any logical or theoretical reason why this should be internally consistent? Not all multi-item tests need to be internally consistent. For example, checklists of health-related behaviors seldom are internally consistent, yet they can be reliable using other metrics (such as test-retest stability) and they can certainly be valid without being internally consistent.
2 Recommendations
2nd Feb, 2014
Christian Gaden Jensen
University of Copenhagen
Thank you, Kenneth. The 24 (8 positive, 8 negative, 8 neutral) items are single, monosyllabic words (e.g. 'leg' for neutral, 'kill' for negative, 'Kiss' for positive). This list is shown 5 consequtive times with immediate recall after each. Then a distraction list is presented and recalled, after which the participants is asked to recall the target list again, without seeing it. Other tests are performed for 30 minutes, after which another target-list recall trial is conducted. This administration is identical to the one used in Rey's Auditory Verbal Learning Task (RAVLT), perhaps the most-used test of non-affective verbal list learning. -I would absolute love to hear examples of tests that do not need to demonstrate interal consistencies in subscales (sub-performances), and why this is so? Thank you, Kenneth!
1 Recommendation
2nd Feb, 2014
Cameron Paul Hurst
Queensland Institute of Medical Research
@Ronan. I TOTALLY agree with you BUT
99% of reviewers don't. So how does one get an alpha of 0.65 past the reviewer ESPECIALLY in the psychology discipline, an area claiming the most dogamtic reviewers in all discipline across which I have published?
It is one thing to point out the urban legend (and again I totally agree with your comment) I hope you are equally pragmatic when it comes to many of the other magic numbers e.g.. Cohens' Kappa of 0.7, ICC=0.7, RSMEA -0.06 (or 0.05). GFI, CFI TFI >-.9 (or >0.95) and many other fit measures used in SEM and CFA.......Our problem is how do we get around these fanatical reviewers who just won't budge????
Should I even mention p