In Structural Equation Modeling there are many goodness of fit indices like GFI, NFI etc. I am searching for an up-to date and widely accepted discussion(s) on interpreting those indices, and for preferring indices to report. I am using LISREL.
thank you very much for your well-articulated response. As you'll see, we
seem to agree on the fundamental points - hence our disagreement on the more
specific aspects may be due (as most) on different perspectives on the role of model misfit.
You say that one can cheat by lowering the sample size. Well yes, a cleanly fitting model based on a small sample will imply small evidence. As I said, a fitting model is no proof that the causal structure is correct, only a significant chi-square implies that the model is wrong somehow. Having said that, there are correction forms like the swain correction:
Herzog, Walter, & Boomsma, Anne. (2009). Small-sample robust estimators of noncentrality-based and incremental model fit. Structural equation modeling, 16, 1-27.
Here's an easy to implement R-function where you input the chi-square (and if you like the fit indices) to get a small-sample-size corrected one:
http://www.ppsw.rug.nl/~boomsma/swain.pdf
And above that, the goal should not be to "get a non-significant chisq" but to TEST ones model and to get knowledge if it is wrong. You said that improvement of theories and models is essential and I couldn't agree more. However, improvement starts with the recognition that something has to be improved :)
Your misconception is that we have any statistic or index or other quantitative measure that a model is "less wrong than the models we currently use". All we can do is to listen to the alarm bell. After respecting this alarm, it is on us again to re-think our models. And just to repeat: The fit indices to not measure "degree of correctness".
There are to comments to your mentioning Copernicus and Newton: First, you overstretch our topic (causal model with their implications) with scientific models overall. I experience similar things sometimes when discussants refer to city maps to mention that models are always wrong because city maps are just an abstraction and by definitionem not true. However, this is no counter argument to the notion to strictly test causal models. Causal models are very specific as they are set of claims which are true ore not. These claims refer to two things:
a) The variables measured by the indicators in the model exist and have the meaning defined by the construct definitions and
b) the structure of proposed effects and non-effects is correct.
Wheresas theories of the Copernicus or Newton type are very broad and rather paradigms than models, causal models are sets of concrete claims to be tested.
When you say that all models are wrong and the "p-value based tests that treat model fit as the null hypothesis simply reflect whether there was sufficient power to reject the imperfect model" you present a very skeptical and surprisingly weak standpoint to even your own models. That is: you do not trust a researcher's causal knowledge to formulate hypotheses of existing and non-existing effects? And if you are right and at least one restriction is always wrong: Shouldn't we be interested that this is the case rather than ignore information about it? You wrote about improving models: Fit indices lead to accepting a wrong model as good. This does not lead to improvement but rather to conservation of wrong models.
Second: Yes, wrong models can be useful but, seriously, this cannot be really a justification to ignore evidence about the wrongness? If you wrong model leads to practical applications that are successful, this happens rather by accident. When I go to an doctor or fly with an airplane I hope that the causal model underlying the doctor's or mechanic's work is sound and correct.
You stress the "prediction of data". Here, you falsely equate the match of the implied and empirical covariances with "prediction". The point in SEM is that the causal structure imlies a structure of data (unconditional dependencies/covarariances and conditional independencies). These empirical patterns are vital if the structure is correct. A common factor model - e.g. - is not there to simply predict covariances (for what reason should that be interesting?). Rather the covariances and the conditional independencies (i.e., stochastic independence) is vital is this causal structure correct. A common factor model can "fairly well predict the covariance pattern" (as it is done in exploratory factor analysis) but may be completly nonsense. And if the measrement model is nonsense, the factor is a statistical artefact and consequently all paramters involving this factor (e.g., structural effects), in the same way. A desaster. The devil is in the details.
Above that, I cannot believe that prediction of variables (Rsq) is more important than correctness of causal claims. If this where true, why do you in medical science set up experiments with randomized groups?? Go on and let the sample self-select which opens the door to self-selection and confounding and still, simply predicting the outcome with the dummy (treatment/control) should give a nice Rsq. I hope you forgive me this polemic ;) No, you don't do that because the tested drug or medical intervention would possible be of no effect or even makes things worse and you want to know its effect.
I hope that we agree of these points
a) many CAUSAL models may be wrong and science should try to find out which models and where they are wrong and improve these models
b) Fitting models based on wrong or bad data (incl. small samples) are not really evidence for the proposed structure
c) Prediction is important but the essence of science is to know WHY things predict others (-> know the causal mechanisms)
d) there is noe proof of the causal structure only corroboration/support.
As I started to take the chisq test more serious I had many models which lead to EUREKA experiences once I took a look at question wordings, re-thought the meaning of the variables (apart from vage "constructs"), looked up model diagnostics (e.g., stand. residuals). I learned to much from taking a closer look. The chisq is not destructive - its the start of a learning process :)
The first is that you can choose whichever fit indices support your point of view. This is not the methodologically best way to go about things, but I think that this is frequently done and reviewers often have no idea.
The better answer is that commonly used indices include: CFI, TLI, RMSEA, and SRMR. A recommendation by Hu and Bentler is that 0.95 or higher be used as the cutoff for CFI and TLI. For RMSEA a value close to (preferably lower than) 0.06 is an appropriate cutoff. For SRMR a value of 0.08 is the appropriate cutoff.
Reference:
Hu, Li-tze, and Peter M. Bentler. "Fit indices in covariance structure modeling: Sensitivity to underparameterized model misspecification." Psychological methods 3.4 (1998): 424.
Now, as for which fit index, this is a slightly different issue. The above study recommends the use of both
1. SRMR
and
2. One of the following: CFI, TLI, or RMSEA
In full disclosure, the above reference does study other fit indices as well, but I have focuse here on the most commonly reported fit indices. In general if you are choosing some fit index that is not one of these common fit indices, you risk raising the question of why that is. You may have a good reason, but there may be suspiscion that you are cherry picking an index in order ot find one that supports your model.
Having said all of this, it is arguable that there are no hard and fast rules about cutoffs. Hu and Bentler themselves caution about this. You can also see in the references of the above paper that more lenient fit criteria have previously been suggested. Some people still go by the more lenient criteria, but as far as I know these more recent numbers are the most up to date.
The other problem worth mentioning here is that some fit indices are sample size sensitive. For example, an SEM model usually has an associated chi-square (upon which most fit criteria are based).
The significance of the chi-square is a fit statistic, but it is one that is not used, because model fit is treated as the null hypothesis. (i.e. it penalizes larger samples and will usually reject models with a sample size of any more than a few hundred)
RMSEA on the other hand tends to penalize small samples, so it is great for use in big SEM studies (and ideally studies are big rather small). CFI is relatively sample size independent, which gives it some flexibility.
This is only a small taste of choosing a fit index but will hopefully get you started. A website by David Kenny:
http://davidakenny.net/cm/causalm.htm
has some additional information (and some additional references). The journal Structural Equation Modeling has also published a large number of papers on this topic if you really want to drill down into the details.
Thank you very much for your informative and comprehensive answer. One point is that I use LISREL, which does not yield TLI value, is only CFI, RMSEA and SRMR enough?
just for the protocol and with all my respect, i would like to present a contrasting perspectiv. In particular, I would like to make a plea for using the chi-square test for model evaluation.
a) Fit indices like RMSEA et al. were designed to measure the degree of causal misspecification and thats how they are interpreted: A model with a better Fit index is preferred over a model with a lower index as it is perceived as "more correct" / closer to the truth.
The problem here is that in this case the amount of difference between the model implied covariance matrix and the empirical covariance matrix is errornously equated with the degree of causal misspecification aka "the closer both matrices the more correct the model".
And this is unfortunatley wrong, as wrong models can show very close matrices, and overall correct models with minor causal errors can imply huge differences due to some spots with high impact on the implied matrix. Conclusion: Whenever a model shows a "data problem" this may indicate a serious "model problem".
b) The reason why fit indices were invented were simply due to marketing considerations. Here is a nice anecdote from Dag Sörbom the co-inventor of LISREL:
"A telling anecdote in this regard comes from Dag Sorböm, a long-time collaborator of Karl Joreskög, one of the key pioneers of SEM and creator of the LISREL software package. In recounting a LISREL workshop that he jointly gave with Joreskög in 1985, Sorböm notes that: ‘‘In his lecture Karl would say that the Chi-square is all you really need. One participant then asked ‘Why have you then added GFI [goodness-of-fit index, i.e., an approximate fit index]?’ Whereupon Karl answered ‘Well, users threaten us saying they would stop using LISREL if it always produces such large Chi-squares. So we had to invent something to make people happy. GFI serves that purpose’ (McIntosh, 2012, p. 10)’’.
McIntosh, C. (2012). Improving the evaluation of model fit in confirmatory factor analysis: A commentary on Gundy, C.M., Fayers, P.M., Groenvold, M., Petersen, M. Aa., Scott, N.W., Sprangers, M.A.J., Velikov, G., Aaronson, N.K. (2011). Comparing higher-order models for the EORTC QLQ-C30. Quality of Life Research, 21(9), 1619-1621.
c) From the beginning there were some conceptions of the chi-square test that are strange:
c1) "The chi-square test is sensitive for sample size" -- This statement is expressed as if it was a misbehavior of the chi-square test. This is nonsense. The chi-square test is a statistical test that MUST increase its power due to sample size.
c2) "The chi-square test (in large samples) reflects trivial causal errors" -- well, yes, in large sample (theoretically) trivial errors - *that are consistent and systematic* (i.e. beyond random sampling error) - are detected. Again, this is what the test should do - namely evaluate if descrepancies can be attributed to samping error or something beyond. HOWEVER: This cannot be turned around to "because my sample is large and the chisq is significant, the error is trivial" (--> conversion error). That is, a sign. chisq may be theoretically trivial OR point to some serious problems. The simple conclusion is: Investigate it.
In a nutshell: With observational data, we have a naturally given problem to make causal conclusions as there are many alternative models. By relying on fit indices we accept many wrong models as good and this damages science. Because there are these natural problems, we should do everything to rule out wrong models. The chisq test is in this regard simply valuable because it reports that SOMETHING is wrong and should be investigated - nothing more! Sometimes diagnostics and re-considerations of the theory may lead to a surprise and a real learning outcome - sometimes we continue with a significant test.
Improving model, however, requires knowledge about possible reasons for misfit. Pearl's work on d-separation or Writhts path tracing rules may be a good starting point (google for it).
The popular simulations, e.g., by Hu & Bentler had simulated models with a very strong focus - for instance factor models with wrong loadings - but our models may be wrong in other regards - most importantly completely misspecified factor structures - see the article by Hayduk that should simply enlarge the horizon of possibilities:
Hayduk, L. A. (in press). Seeing perfectly fitting factor models that are causally misspecified: Understanding that close-fitting models can be worse. Educational and Psychological Measurement.
Again, the chisq test may often point to unimportant stuff but often it will be serious. Finding a cleanly fitting model does not prove the correctness of the causal assumption but the respective model has passed high barriers implying a supported / corroborated kind of evidence. Relying on fit indices, in constrast weakens our research.
Perhaps you give the chisq view a chance and you got convinced:
Hayduk, L. A., Cummings, G. G., Boadu, K., Pazderka-Robinson, H., & Boulianne, S. (2007). Testing! testing! one, two, three - Testing the theory in structural equation models! Personality and Individual Differences, 42(5), 841-850.
Hayduk, L. A., Cummings, G. G., Stratkotter, R. F., Nimmo, M., Grygoryev, K., Dosman, D., et al. (2003). Pearl's d-separation: One more step into causal thinking. Structural equation modeling, 10(2), 289-311.
Hayduk, L. A., & Pazderka-Robinson, H. (2007). Fighting to understand the world causally: Three battles connected to the causal implications of structural equation models. In W. Outhwaite & S. Turner (Eds.), Sage Handbook of Social Science Methodology (pp. 147-171). London: Sage Publications.
And in the mean time, this view is getting more and more proponents as reflected in some recent textbooks and top tier journal articles.
Kline, R. B. (2011). Principles and practice of structural equation modeling (Third ed.). New York, London: The Guilford Press.
Shipley, B. (2000). Cause and correlation in Biology: A user's guide to path analysis, structural equations and causal inference. Cambridge UK: Cambridge University Press.
Mulaik, S. A. (2009). Linear Causal Modeling with Structural Equations. Boca Raton: Chapman & Hall.
Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (2010). On making causal claims: A review and recommendations. The Leadership Quarterly, 21, 1086-1120.
McIntosh, C. (2007). Rethinking fit assessment in structural equation modeling: A commentary and elaboration on Barrett (2007). Personality and Individual Differences, 42(5), 859-867.
McIntosh, C. (2012). Improving the evaluation of model fit in confirmatory factor analysis: A commentary on Gundy, C.M., Fayers, P.M., Groenvold, M., Petersen, M. Aa., Scott, N.W., Sprangers, M.A.J., Velikov, G., Aaronson, N.K. (2011). Comparing higher-order models for the EORTC QLQ-C30. Quality of Life Research, 21(9), 1619-1621.
I hope you forgive me this long note and allow to have presented a contrasting poin of view :)
JÖRESKOG, K.; SÖRBOM,D. LISREL® 8: Structural equation Modeling with the SIMPLIS Command Language. Lincolnwood: Scientific Software International, 1993
HAIR JR, J. F.; BLACK, W. C.; BABIN, B.J.; ANDERSON, R. E. Multivariate Data Analysis. 7 a ed. Nova Jersei:Pearson Prentice Hall, 2010
They will give you the cut-off points for CFI, NFI, NNFI GFI, AGFI, RMSEA and all the other parameters of adjustment. As already pointed by our colleagues for this topic, the discussion of the cutt-off points for the Godness-of-fit indices is extensive. So, these references could be helpfull since they were considered specifically for the software that you are using. .Regarding the parameters, besides those mentioned above, I will advise you to use the normed chi-square; in this way you will be reporting absolute, incremental and parsimony fit indices
What a fascinating discussion this question prompted. In particular, I find Holger’s description retelling of Dog Joreskög’s anecdote thrilling However, I if you are planning to submit your papers before American journals, I would strongly encourage to consult The Reviewer’s Guide to Quantitative Methods in Social Sciences by Hancock and Mueller. Particularly relevant to guide the selection of fit indices are chapters 8, 28 and 29. On the other hand, Byrne (2006) as well as Finney and DiStefano (2013) also warn that researchers need to be mindful of departures of normality when selecting fit indices and that some adjustments may be needed particularly in comparing alternative models. I would also add that Brown (2004), Geiser (2013) and Byrne (2006) strongly advise for the estimation of 90% confidence intervals for the RMSEA. While LISREL and EQS were my favorite SEM programs, I find Mplus to be a more powerful and up to date in estimating indicators of fit under a series of conditions (e.g., lack of normality, non-continuous variables, ordinal variables, handling of missing cases, imputation, Monte Carlo simulations) than are LISREL and EQS.
a) every software package (I would like to recommend the freeware package lavaan for R (www.lavaan.org) ) offers possibilities to correct the chisq test (e.g., Satorra-Bentler or Yuan-Bentler correction or even DWLS if you like).
b) deviations from multi-normality is very seldom the reason for p-values far below .01 (e.g., p < .000001) which researchers often experience (me too ;) ). I just say that as I often see that the nonnormality issue is used to explain away the significant chisq value :)
Holger makes good points, but I would like to place his points in perspective and raise a few that are not mentioned. I previously quoted others but will now, admittedly, given my own, possibly incorrect, opinions.
The chi-square test basically attempts to find a statistically significant difference between observed and predicted data data. As such, the chi-square test attempts to reject model fit by demonstrating a statistically significant difference between implied and predicted data. The best way to achieve model fit based on the chi-suare test is to test a model in a small sample size.
First a short and sweet objection that I have is that any statistical test which is better able to accept a model if the model is developed in a small sample than in a large sample is problematic. In general science considers more objservations better.
Now for my more long-winded objection:
Holger's argument rests on a premise:
We don't want to accept wrong models as scientific fact.
It may at first seem that such a premise is without doubt the goal of all science, but this is not true. This might be an academic goal of science, but practically speaking what science actually does is to develop new models that are simply less wrong than the models that we currently use.
In general, science advances by establishing wrong models (that still explain the data to some degree) as fact and subsequently improving on these models. The ancient and early medieval astronomers had a system for mapping heavenly objects composed of epicycles and determinants - was it wrong? Completely wrong, yet they could still use it to predict the motion of heavenly bodies to within some degree of accuracy sufficient for calendars and navigation. Copernicus developed a new completely wrong theory that kept the system of epicycles and determinants but simplified it by placing the sun at the center. Newton's "laws" of physics, which ostensibly finally explained why heavenly bodies move the way that they do, are in fact a wrong model of how the universe works. That being said this wrong model is so good that we continue to use it in day-to-day applications knowing that it is wrong. Likewise, even the theory of relativity as envisioned by Einstein is wrong in that it did not allow for the probabiltic phenomena of quantum mechanics.
I am not sure that a theoretical model in science has ever been developed that we are satisfied is actually true - models in science are simply good enough to be accepted as true until someone gets some new data that the model can't explain or finds a better way to explain previously observed data. Based on this view of science, p-value based tests that treat model fit as the null hypothesis simply reflect whether there was sufficient power to reject the imperfect model.
In short science is about looking at how well a model explains the data and developing models that explain the data better. This is not a question answered by chi-square tests. Now, arguably the chi-square test can be modified to reflect degree of fit rather than statistically significant misfit. Oh wait! Tthis is exactly what the commonly used fit indices are.
Finally, let me suggest an alternative approach to p-value based testing that as far as I can tell nobody has developed for SEM but has been developed in clinical medicine. I will use an analogy to explain it: In medicine, we sometimes want to test for the equivalence of two treatments. If we simply showed no statistically significant difference between the two treatments, then we would suffer from the same problem as I have just described above. Instead, we set a-priori limits based on substantive grounds for what we would consider clinically no different. We then do a p-value based test for the difference in the two treatments, and if the 95% CI for this difference is within a-priori established limits of no meaningful difference we call the two treatments clinically equivalent (regardless of whether a statistically significant difference can be found).
I think that the chi-square (or some p-value / significance-based test) would be very useful if we took this approach - we set a-priori limits for just how much predicted values can differ from observed data and then see how much our model does deviate from observed data. We put confidence intervals on this, and if the confidence intervals of the model deviation are within the a-priori established window of no meaningful difference, then we accept the model. (If I am wrong and this does exist, please let me know.) - Larger sample sizes tend to shrink confidence intervals making the confidence intervals better able to fit within the a-priori defined window of acceptable deviance, so this approach would not give an advatage to models tested in small samples.
Çınar, what type of analyses are you planning on running? The fit indices you choose should depend on what you are trying to assess (i.e., structural relationships, measurement equivalence/invariance, etc.). Below are two readings that I found particularly useful.
Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107(2), 238-246.
Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling, 9(2), 233-255.
Just a couple of key articles that are rather conservative when it comes to GFI, great sources of information:
Bentler, P. M., & Chou, C. P. (1987) Practical issues in structural modeling. Sociological Methods & Research, 16, 78-117.
Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness-of-fit in the analysis of covariance structures. Psychological Bulletin, 88, 588-600.
Hu, L., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity to underparameterized model misspecification. Psychological Methods, 3, 424–453.
Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55.
thank you very much for your well-articulated response. As you'll see, we
seem to agree on the fundamental points - hence our disagreement on the more
specific aspects may be due (as most) on different perspectives on the role of model misfit.
You say that one can cheat by lowering the sample size. Well yes, a cleanly fitting model based on a small sample will imply small evidence. As I said, a fitting model is no proof that the causal structure is correct, only a significant chi-square implies that the model is wrong somehow. Having said that, there are correction forms like the swain correction:
Herzog, Walter, & Boomsma, Anne. (2009). Small-sample robust estimators of noncentrality-based and incremental model fit. Structural equation modeling, 16, 1-27.
Here's an easy to implement R-function where you input the chi-square (and if you like the fit indices) to get a small-sample-size corrected one:
http://www.ppsw.rug.nl/~boomsma/swain.pdf
And above that, the goal should not be to "get a non-significant chisq" but to TEST ones model and to get knowledge if it is wrong. You said that improvement of theories and models is essential and I couldn't agree more. However, improvement starts with the recognition that something has to be improved :)
Your misconception is that we have any statistic or index or other quantitative measure that a model is "less wrong than the models we currently use". All we can do is to listen to the alarm bell. After respecting this alarm, it is on us again to re-think our models. And just to repeat: The fit indices to not measure "degree of correctness".
There are to comments to your mentioning Copernicus and Newton: First, you overstretch our topic (causal model with their implications) with scientific models overall. I experience similar things sometimes when discussants refer to city maps to mention that models are always wrong because city maps are just an abstraction and by definitionem not true. However, this is no counter argument to the notion to strictly test causal models. Causal models are very specific as they are set of claims which are true ore not. These claims refer to two things:
a) The variables measured by the indicators in the model exist and have the meaning defined by the construct definitions and
b) the structure of proposed effects and non-effects is correct.
Wheresas theories of the Copernicus or Newton type are very broad and rather paradigms than models, causal models are sets of concrete claims to be tested.
When you say that all models are wrong and the "p-value based tests that treat model fit as the null hypothesis simply reflect whether there was sufficient power to reject the imperfect model" you present a very skeptical and surprisingly weak standpoint to even your own models. That is: you do not trust a researcher's causal knowledge to formulate hypotheses of existing and non-existing effects? And if you are right and at least one restriction is always wrong: Shouldn't we be interested that this is the case rather than ignore information about it? You wrote about improving models: Fit indices lead to accepting a wrong model as good. This does not lead to improvement but rather to conservation of wrong models.
Second: Yes, wrong models can be useful but, seriously, this cannot be really a justification to ignore evidence about the wrongness? If you wrong model leads to practical applications that are successful, this happens rather by accident. When I go to an doctor or fly with an airplane I hope that the causal model underlying the doctor's or mechanic's work is sound and correct.
You stress the "prediction of data". Here, you falsely equate the match of the implied and empirical covariances with "prediction". The point in SEM is that the causal structure imlies a structure of data (unconditional dependencies/covarariances and conditional independencies). These empirical patterns are vital if the structure is correct. A common factor model - e.g. - is not there to simply predict covariances (for what reason should that be interesting?). Rather the covariances and the conditional independencies (i.e., stochastic independence) is vital is this causal structure correct. A common factor model can "fairly well predict the covariance pattern" (as it is done in exploratory factor analysis) but may be completly nonsense. And if the measrement model is nonsense, the factor is a statistical artefact and consequently all paramters involving this factor (e.g., structural effects), in the same way. A desaster. The devil is in the details.
Above that, I cannot believe that prediction of variables (Rsq) is more important than correctness of causal claims. If this where true, why do you in medical science set up experiments with randomized groups?? Go on and let the sample self-select which opens the door to self-selection and confounding and still, simply predicting the outcome with the dummy (treatment/control) should give a nice Rsq. I hope you forgive me this polemic ;) No, you don't do that because the tested drug or medical intervention would possible be of no effect or even makes things worse and you want to know its effect.
I hope that we agree of these points
a) many CAUSAL models may be wrong and science should try to find out which models and where they are wrong and improve these models
b) Fitting models based on wrong or bad data (incl. small samples) are not really evidence for the proposed structure
c) Prediction is important but the essence of science is to know WHY things predict others (-> know the causal mechanisms)
d) there is noe proof of the causal structure only corroboration/support.
As I started to take the chisq test more serious I had many models which lead to EUREKA experiences once I took a look at question wordings, re-thought the meaning of the variables (apart from vage "constructs"), looked up model diagnostics (e.g., stand. residuals). I learned to much from taking a closer look. The chisq is not destructive - its the start of a learning process :)