Research methodologists have identified serious problems with the use of "control variables" (aka nuissance variables, covariates), especially in survey research. Among the problems are uninterpretable parameter estimates, erroneous inferences, irreplicable results, and other barriers to scientific progress. For the sake of discussion, I propose that we stop using control variables altogether. Instead, any variable in a study should be treated like the others. If it is important enough to include in the research, it is important enough to be included in your theory or model. Thoughts for or against this proposition?
For me without control, there is no experimentation. As far as you are conducting a true experiment testing for a cause-effect relationship, control is very necessary otherwise there will be alternative alternation/explanation. The result you obtain may not be a result of the manipulation of your independent variable rather it may be the result of the confounding variables. Including all secondary variables in your research doesn't make much sense
There have been huge developments in this area in terms of concepts and techniques in the last ten years (building on some very old ideas) that go well beyond the simplistic notions of 'control' variables ; they are succinctly and very well summarised in Holder Steinmetz's reply to this question:
https://www.researchgate.net/post/Is_there_any_different_with_moderate_variable_and_control_variable_in_SEM
"there are five types of variables in any causal system (besides your target independent variable X and outcome Y):......"
I find http://dagitty.net/ helpful in thinking through these issues.
I am sure such DAG based approaches will have considerable influence on survey research helping to tackle the issues raised in the original post: for example, see
Preprint Graphical Causal Models for Survey Inference
Hello all
(thanks Kelvyn Jones )
Thomas E. Becker I saw that you have an ORM paper that seem to correctly complain about the lack of theory when deciding on control variables. Hence instead of "deleting the idea" I would propose to come up with a "causal identification strategy".
With regard to the problems you referred to when considering controls, I remember that especially psychologists (sorry for the bashing, I am one myself) always confuse three levels
1) measures vs. true attributes
2) relationships vs. effects
3) samples vs. population
That is, thinking about controls should deal with 1) the true attributes (and not the potentially invalid/biased measures), 2) the causal role of the control in the system (i.e., as a confounder or part of a confounding path), and 3) the population (and not a potentially biased sample).
I had only time to screen your paper but noticed a paragraph at the end where you cited some work (e.g., by Paul Spector) about problems. As aforementioned, my hypothesis would be that these concern points 1-3 above. :)
All the best,
Holger
The main problem I see with deleting control variables is what I was taught to call "specification error," where a regression that omitted any third variable that was correlated with both the independent variable and the dependent variable would bias the estimate of the regression coefficient linking the IV and DV.
So, even if one should always prefer to have a theoretical justification for including a variable in the analysis, it is still better to be safe than sorry.
Hi David,
thinking about the specification error in terms of a variable being merely "correlated with both the independent and the dependent variable" is part of the problem, as "correlated" can mean that
a) this variable is a mediator between the IV and the DV
b) this variable is a common cause of both
c) this variable is a common outcome of both
You have to control this variable in case b and you MUST NOT control this variable in a and c. Controlling in scenario "a" means, you kill the indirect effect (unless you deliberately want to do this to exclude a path competing with your hypothesis). Controlling in scenario "c" means you get a collider bias. A similar case is controllig for an outcome of Y.
So you need some form of assumption about the role of the variable and I agree that when you are very unsure (and unable to form an assumption), you can do some kind of sensitivity analysis.
Best,
Holger
What I meant to cover was the simplest case, where there is one independent variable, one dependent variable, and one omitted variable. In this case, the omission of a variable that is correlated with both the independent and dependent variable variable will, by definition, affect the partial regression coefficient between them. How this turns into direct and indirect effects goes beyond what I had in mind, but if the original correlation (total effect) is mis-specified, then it is hard for me to see how further partial coefficients will be correct.
Hello David,
yes, the partial correlation will be affected by adjusting for either a mediator, confounder, or collider. However, a *correlation* cannot be misspecified (of course beyond departures of the linearity), it is just ...well... a correlation. Pearl nicely talks about this in the chapter in his "book of why" about Karl Pearson who --being a causal-phobic positivist--had to apply some immense mental gymnastics to talk about "spurious correlations" without even thinking about cause and effects. What an achievement in coping with cognitive dissonance. There is no "spurious correlation". A correlation is just a relationship that may be created by dozens of underlying causal processes.
A "specification error", in contrast, involves causal assumptions. This is where control variables come into play (and again, I am not talking about errors in specifying the form of the relationship which are a different form of specification error).
By the way: What happened with the creator of this discussion? Thomas E. Becker are you still there? I don't hope you are one of the folks who start a thread and never come back (what unfortunately often happens on RG)
:)
All the best to all.
Holger
Hi David again,
I later stumbled across the following part of your sentence that me feeling confused whether I had misinterpreted what you head in mind
"...if the original correlation (total effect) is mis-specified, then it is hard for me to see how further partial coefficients will be correct."
First, a step forward would be to stop equating correlation with "total effect".
Second, the total *effect* of a variable will be biased if you
1) fail to adjust for existing common causes/confounder or
2) errorneously control for a not-theorized mediator or
3) a common outcome / colliders
This had nothing to do with any intention to estimate direct or indirect effects. This has only to do with a) failing to do the right thing (controlling for confounders) and b) doing the wrong thing (controlling for mediators / colliders.
Hence, the "(total effect) is misspecified" if you do not control for confounders and it will be unbiased if you do so. Hence the partial coeffcient will be unbiased.
All the best
Holger
It seems like we agree that effects will not be properly estimated (i.e., mis-specified or biased) if one does not control for confounding variables.
Thanks, all, for your thoughtful comments. One serious problem with controlling for so-called confounding variables is that it typically makes tests of the related hypothesis meaningless. Take the simplest case where a scientist hypothesizes that X causes Y. In analyzing the corresponding data, the scientist decides to control for Z because he or she thinks Z can also cause Y. But testing the hypothesis that X causes Y while controlling for Z is a different test than the hypothesis that X causes Y. The coefficient for the partialledd X may be statistically significant while that for X is not, and vice-versa. There are numerous other issues with statistical control, but I wanted to mention this one because it gets at the heart of the problem. Note, too, that unethical researchers could use statistical control as a means of fishing for significance (a type of HARKing). But that's another story . . .
Hello Thomas,
a few comments
1) You say "...the scientist decides to control for Z because he or she thinks Z can also cause Y"
I would say that this a insufficient reason to control for Z because if Z does not cause ALSO X, ignoring Z will not do harm. Controlling for Z may have benefits, mainly for efficiency/SE/power but ignoring it won't bias the effect.
2) "But testing the hypothesis that X causes Y while controlling for Z is a different test than the hypothesis that X causes Y."
If the substiantial hypothesis is "X causes Y" and Z is indeed a common cause, than only controlling for Z will test the substantial hypothesis. If the hypothesis is wrong and there is only confounding (but no X-Y effect) than you will get a false positive and conclude an effect where there is none.
3) "The coefficient for the partialledd X may be statistically significant while that for X is not, and vice-versa"
Yes, and depending on the true data generating process, one of both significance tests is meaningfull and the other is not. Just note that you make the mistake I complained about in my early posting, namely to mix causal identification issues with statistical issues. Considering modeling issues (including selection of controls) is a purely theoretically-population-considering issue (not an empirically-data-statistical issue)
4) "Note, too, that unethical researchers could use statistical control as a means of fishing for significance (a type of HARKing)"
This is true and in fact a argument IN FAVOR of theoretically explain the role of potential controls as this narrows the window for "tinkering".
All the best,
Holger
Hi Holger,
Let me stay in the role of an advocate for deleting control variables. My replies to your recent comments are below in CAPS:
1) You say "...the scientist decides to control for Z because he or she thinks Z can also cause Y"
I would say that this a insufficient reason to control for Z because if Z does not cause ALSO X, ignoring Z will not do harm. Controlling for Z may have benefits, mainly for efficiency/SE/power but ignoring it won't bias the effect.
I AGREE!
2) "But testing the hypothesis that X causes Y while controlling for Z is a different test than the hypothesis that X causes Y."
If the substiantial hypothesis is "X causes Y" and Z is indeed a common cause, than only controlling for Z will test the substantial hypothesis. If the hypothesis is wrong and there is only confounding (but no X-Y effect) than you will get a false positive and conclude an effect where there is none.
I CAN'T AGREE. THESE ARE FUNDAMENTALLY DIFFERENT HYPOTHESES, CONCEPTUALLY AND MATHEMATICALLY. THERE MUST BE CORRESPONDENCE BETWEEN THE HYPOTHESIS AND THE ANALYSIS USED TO TEST IT. IF THE HYPOTHESIS SPECIFIES A CONTROL VARIABLE (CV), THEN SO SHOULD THE ANALYSIS. IF NOT, NOT.
3) "The coefficient for the partialledd X may be statistically significant while that for X is not, and vice-versa"
Yes, and depending on the true data generating process, one of both significance tests is meaningfull and the other is not. Just note that you make the mistake I complained about in my early posting, namely to mix causal identification issues with statistical issues. Considering modeling issues (including selection of controls) is a purely theoretically-population-considering issue (not an empirically-data-statistical issue)
I DON'T BELIEVE IT IS A MISTAKE TO ALIGN THE ANALYSIS WITH THE HYPOTHESIS. IF YOUR HYPOTHEIS (MODEL) IS X --> Y BUT YOUR ANALYSIS IS X --> Y WHILE CONTROLLING FOR Z, YOU HAVE NOT TESTED X --> Y.
4) "Note, too, that unethical researchers could use statistical control as a means of fishing for significance (a type of HARKing)"
This is true and in fact a argument IN FAVOR of theoretically explain the role of potential controls as this narrows the window for "tinkering".
WE MAY BE TALKING ABOUT TWO DIFFERENT ISSUES. WHAT I AM SAYING IS THAT IF A SCIENTIST IS INTENT ON FINDING SOME - ANY - "SIGNFICANT" RELATIONSHIP, AN EASY WAY TO DO THIS IS TO THROW CVS INTO THE ANALYSIS. IF CV1 DOESN'T PRODUCE A SIGNIFICANT X --> Y RESULT (I.E., SIGNIFICANT X COEFFICIENT), TRY CV2; AND SO ON UNTIL, BY CHANCE, YOU FIND SOMETHING YOU THINK IS PUBLISHABLE.
WE MIGHT CALL THIS A PROCESS OF PSEUDO-SUPPRESSION, WHEREBY Z (THE CV) HAPPENS BY CHANCE TO BE MORE HIGHLY CORRELATED WITH X THAN WITH Y, MAKING IT APPEAR AS THOUGH X AND Y ARE RELATED WHEN THEY ARE NOT.
STATISTICALLY, THIS IS AN INSTANCE OF P-HACKING. THAT IS, YOU'RE CONDUCTIG MULTIPLE TESTS WITH DIFFERENT CVS UNTIL YOU FIND ONE WHERE P < .05.
IF YOU THEN PRETEND LIKE YOU MEANT TO CONTROL FOR Z FROM THE BEGINNING, THIS ALSO BECOMES A CASE OF HARKING.
IN SUM, THE USE OF CVS, IN THIS SENSE, DOES NOT NARROW THE WINDOW FOR TINKERING. IT EXPANDS IT. THE THEORETICAL EXPLANATION IN THIS CASEE FOLLOWS THE ANALYSIS (HARKING), IT DOES NOT PRECEDE IT. THIS IS ONE OF SEVERAL REASONS FOR DELETING CVS ALTOGETHER.
Hi Thomas,
first, perhaps it would be possible to choose another style of reponding. Using Caps is usually regarded as shouting and its hard to read :) Perhps using > or something would be possible.
Second, I cannot follow your talk about different hypotheses. At the end, these are simply your claims without any stated reasons. But actually, it is not my point anyway whether to regard these as different hypotheses. My simple point is that if you have a (one) substantial hypothesis and I fail to control for confounders, you don't properly test this hypothesis because the estimation and test is biased.
Imagine, your substantial hypothesis is "ice cream sales lead to death by drowning" based on your rationale that ice creams stresses the digestive system and thus, reducing the muscular blood flow. Then, according to your plea, you would test this hypothesis, without controlling for the obvious confounder "season". As a result, you will get a significant effect which is either completely spurious or strongly biased. In my view, you would have failed to test the hypothesis and draw wrong conclusions about the dangerous consumption of ice cream.
If you however (and perhaps this is the deeper cause of our disagreement) fully adopt an empiricist stance, then a correlation (without controlling anything) is all you want. But again, I would again argue that a "positive" test will not support your initial substantial hypothesis.
With regard to p-hacking, I can only repeat that an explicitly stated identification strategy will tremendously reduce the option to tinker with control variables as any reviewer and reader can judge the reasonableness for including the control variable because only confounder-candidates will make sense, and a tinkered just-by-chance-control will most often not meet any rationale.
So, if you think that the hypotheses "x-->y with control of z" and "x--> *without* control of z" are different then I don't care and you can have that. I would stay with "x-->y" as the *substantial hypothesis* and an explicitly stated identification strategy laying out the reasons to include z belongs to the choice of the overall design to properly test *this* hypothesis.
All the best,
Holger
I just saw that that my response got very long. To spare your time to respond to this pamplet :) a simple question: So you think, we should test hypotheses like the ice cream hypothesis without controlling anything? A correlation table and that's all? What would be the result for the scientific field?
Because this is the gist of the whole discussion....
All the best,
Holger
Thanks for your replies, Holger, you have given me food for thought. Please see my responses in quotes (not in CAPS - I didn't mean to scream) following excerpts of your points.
**Second, I cannot follow your talk about different hypotheses. At the end, these are simply your claims without any stated reasons. But actually, it is not my point anyway whether to regard these as different hypotheses. My simple point is that if you have a (one) substantial hypothesis and I fail to control for confounders, you don't properly test this hypothesis because the estimation and test is biased.
Imagine, your substantial hypothesis is "ice cream sales lead to death by drowning" . . . Then, according to your plea, you would test this hypothesis, without controlling for the obvious confounder "season". As a result, you will get a significant effect which is either completely spurious or strongly biased. In my view, you would have failed to test the hypothesis and draw wrong conclusions about the dangerous consumption of ice cream.**
"With respect, I don't believe my statement about Y = (fX) and Y = (fX,Z) are simply claims. Conceptually and matematically they are objectively different. The hypothesis that X causes Y is different than that of X causes Y while controlling for Z."
" Your ice cream example illustrates some of the problems with control variables (CVs). In a regression equation, controlling for season would mean allowing season to correlate both with amount of ice cream consumed and deaths by drowning. But is season an antecedent of ice cream consumption (e.g., people eat more in the summer than the winter), or a moderator (the effect of ice cream on deaths depends on season)? These are different models, and treating season as a CV is, at best, vague and, at worst, misleading."
"If season is important enough to include in the study, then its role in the model should be specified (e.g., antecedent or moderator), and should be reflected in the hypothesis. 'Ice cream causes death by drowning" is not the same as "People eat more ice cream in the summer, and this causes death by drowning," which is not the same as "Ice cream causes death by drowning, but this effect is more pronounced in the summer."
"Note that if the model is misspecified, the inclusion of season in a given role could be spurious, as could be the effect of ice cream. However, the spuriousness is not necessarily a result of failing to include season - it could be due to including season, or misspecifying its role."
"Here are a few articles that expand on these points:
Carlson, K. D., & Wu, J. (2012). The illusion of statistical control: Control variable practice in management research.Organiza-tional Research Methods, 15, 413–435.
Spector, P. E., & Brannick, M. T. (2011). Methodological urban legends: The misuse of statistical control variables.Organizational Research Methods, 14, 287–305.
Spector, P. E., Zapf, D., Chen, P. Y., & Frese, M. (2000). Why negative affectivity should not be controlled in job stress research:Don’t throw out the baby with the bath water.Journal of Organizational Behavior, 21, 79–95."
**With regard to p-hacking, I can only repeat that an explicitly stated identification strategy will tremendously reduce the option to tinker with control variables as any reviewer and reader can judge the reasonableness for including the control variable because only confounder-candidates will make sense, and a tinkered just-by-chance-control will most often not meet any rationale.**
"The problem is that scientists are smart. Those determined to cheat are often capable of creating post-hoc explanations for CVs that, by chance, produce p < .05. Here's a simple example. Let's say that my hypothesis is ice cream consumption causes death by drowning. During the study I collect data on 20 demographic variables, including gender, race, age, education, and 16 others. One of these variables will, by chance, almost certainly be significantly more strongly correlated with ice cream consumption than with death."
"Let's say that variable is gender. I want to include gender in the equation because it will serve as a pseudo-supressor, making it appear that ice cream and death are related when, in fact, they are not."
"All I need now is a seemingly plausible reason for controlling for gender. I do a little online research of male/female differences, and I discover a study that suggests that men are more reckless than women. In presenting this post hoc explanation in the paper, I say someting like, "Prior research has demonstrated that men are more reckless than women. Therefore, because swimming after eating ice cream may be a reckless act, we controlled for gender."
"Sounds plausible, right? But it is utterly dishonest and disguises the desire to include gender simply because it produces p < .05 for the partial coefficient for ice cream consumption."
Holger, regarding your more recent message, you said: "So you think, we should test hypotheses like the ice cream hypothesis without controlling anything? A correlation table and that's all? What would be the result for the scientific field?
Because this is the gist of the whole discussion....:
No, I'm not suggesting that a correlation table will suffice in all cases. However, a correlation that actually tests the hypothesis in question is better than a regression coefficient that does not.
The analyses should be as simple or complicated as necessary to test the hypothesis or model in question. If you have the simple hypothesis that X is associated with Y, then a correlation between X and Y is the proper test. If the hypothesis is that X --> Y, then an experiemental study needs to be conducted. At the end of the experiment, you can do a simple ANOVA or calculate the correlation between X and Y to see if there is evidence that X causes Y.
If you have a more complicated model, you might use structural equation modeling to test the pattern of coefficients and fit of the model to the data.
These are just examples. My point is that deleting CVs from our scientific vocabulary does not imply that onlly correlations are allowed. Rather, it is that any deductive analyeses should be congruent with the corresponding hypothesis or model - and that the role of each relevant variable be specified (rather than be relegated to the vague role of control variable).
Hi David Morgan! Sorry to be slow responding to your most recent comment. Allow me to continue playing the role of champion of the "delete CVs" argument.
I cannot agree that effects will mecessarily be misspecified or biased if one does not control for confounding variables. Misspecifcation would occur if one models a variable, Z, as an antecedent when, it fact, it is a moderator (or vice-versa). It would also occur if an antecedent or moderator is modeled as a "control variable" (CV).
In regression, a CV is a variable that is usually allowed to correlate with other predictors and the criterion. Conceptually and mathematically, this is entirely different than how one models an antecedent or moderator (e.g., in structural equation modeling - SEM). One serious criticism of CVs is that their relationships with other variables is vague. This is most evident in SEM, where authors sometimes allow a CV to correlate with just the DV, sometimes with just one or more IVs, and sometimes . . . well, you name it.
So, due to vagueness and the often unpredictable effects of CVs, they are more likely to result in misspecification than they are to remedy it.
For more on the topic, see the articles I suggested to Holger. And here's one more:
Becker, T. E., Atinc, G., Breaugh, J. A., Carlson, K. D., Edwards, J. R., & Spector, P. E. (2016). Statistical control in correlational studies: Ten essential recommendations for organizational researchers. Journal of Organizational Behavior, 37,157-167.
Thanks for your input into the conversation.
Hello Thomas et al.
again I also welcome this interesting and relevant discussion. I also welcome the new citation style and would add a further stylistic element. So I will "---" to frame your text.
First, the used notation and things that you later wrote ("if the hypothesis is that x is related to y) again supports my suspicion that you talk about a hypothesis as an expected pattern in the data whereas my only concern is the causal claim underlying this hypothesis and which is a conclusion of the hypothesis development section in a paper. So--as I said--I have no problem with your statement anymore that "x-with control" and "x-without-control" are different patterns and, thus hypotheses. I could even agree if your proposal would be to specify the f(X,Z) as part of the hypothesis instead of abandoning Z as this comes closest to the recommendations in the causal inference literature to specify the identification strategy. I cannot understand this (to me wrong) choice?
Now let's come to the further issues where I think that you have some false impressions. Sorry, that this is such a burden (as it will get long).
With regard to the ice cream example you wrote
-------------
" Your ice cream example illustrates some of the problems with control variables (CVs). In a regression equation, controlling for season would mean allowing season to correlate both with amount of ice cream consumed and deaths by drowning. But is season an antecedent of ice cream consumption (e.g., people eat more in the summer than the winter), or a moderator (the effect of ice cream on deaths depends on season)? These are different models, and treating season as a CV is, at best, vague and, at worst, misleading."
-------------
The decision to control for season follows from the assumption that season causes ice-cream consumption as well as death by drowning as this is the reason for the ice-death correlation to be spurious. Hence, it should obviously correlate (in the population) with both. Otherwise, the causal assumption that underlies the selection of the control would be wrong. "Allowing to correlate" hence sounds weird. Of course, this is the point.
Second, a variable (e.g., "Z") can *both* be a confounder as well as a moderator. These are different *functions* of Z and not necessarily *models* as you have different model choices:
a) to ignore Z completely (resulting in bias)
b) just control for it without modeling the interaction (which MAY have biasing effects that go beyond regarding the "main effect" of X simply as the average across the levels of Z, see
Winship, C., & Elwert, F. (2010). Effect heterogeneity and bias in main-effects-only regression models. In D. Rina, G. Hector, & Y. H. Joseph (Eds.), Heuristics, Probability and Causality: A Tribute to Judea Pearl. College Publications.
c) to test a full moderation model (which controls Z and thus, eliminates the spurious/confounded part of the X-Y relationship and addresses the effect heterogeneity). A more interesting example would be when Z is a common OUTCOME of X and Y and a moderator--as this would imply that any interaction model (inherently controlling for Z would bias the effect. This is indeed a headache-causing scenario.
It is funny that you mention the "bath water paper" (M. Frese was my "doctor father") as this triggered me at the time and was one building block for my current view (especially my claim that psychologists don't think causally and always mix correlation/effects, measures/entities and sample/population).
The next issue was the p-hacking topic (relabeled to "Z-hacking" :). This is really interesting! You created the scenario in which a researcher had measured 20 demographic variables which are spurious (i.e., uncorrelated in the population with X and Y) and create by chance correlations in the sample. My response has two parts--again one focusing on the causal aspect, the other focusing on the statistical aspect (false positives, alpha error). Spoiler: My conclusion will be that this scenario is not unlikely but not a great danger to count as an example against control variables.
Let's start with the statistical issue. Initially I completely agreed that something like that happens all the time and is likely. However, I was curious to know the exact probability and did a small Monte-Carlo simulation. I attached a small text file with the R code so that you can inspect whether I did something wrong.
In the simulation, I created three non-correlated variables (X,Y,Z), then I drew 2000 samples, in each estimating a simple Y~X regression followed by a Y~X+Z regression reflecting the procedure of Z-hacking (i.e., our researcher starts with a X->Y effect and then includes Z as a control variable). The question is: how often does a scenario occur in which the X-effect in the simple regression is *non-significant* and then becomes significant when a random control variable is included? The results show that this was the case in only 7 of the 2000 cases which is a probability of p = .004 (whereas each singular correlation had the expected alpha error rate)! As all 20 random control variables have this probability, the chance that one of them creates a spurious X->Y effect is 20*.004 = .08.
At least, it is higher than 5% but on the other hand not so dangerous to use it as a defense for not including controls (especially when a clearer rationale is demanded). Besides that, this is an extreme scenario, in which all controls are uncorrelated (which most often will not be the case). In the majority of cases, X,Y, and Z will be correlated, therefore controlling Z will do "something". However, I would expect that in the vast majority of cases, controlling Z will *reduce* the effect of X making it rather non-significant than significant.
Actually, there is only one scenario, in which a non-existing X-->Y effect suddenly becomes (spuriously) significant when controlling for Z--that is when Z is a common outcome of both (a collider). All other forms (mediation/confounding) will lead to some portion of the correlation being eliminated.
Besides the issue of false positives, a more fundamental question is: would we have to expect a *bias* (a long-term/large scale incorrect effect)? And this also hinges on the actual causal role of the control variable: IF Z is a confounder, then controlling Z will *not bias* the effect.
As mentioned above, the only scenarios in which controlling will constantly mess things up is when Z is endogenous (i.e., a mediator or collider). THIS will bias the effect. With regard to your 20 demographic variables, it is highly unlikely that X (e.g., ice cream consumption) changes your gender, race, or social status. Hence, controlling demographic variables will never hurt in the long run.
Now to the post-hoc-pseudo-rationalization by our researcher and the importance of a real clearly stated identification rationale. You wrote:
-------------
"All I need now is a seemingly plausible reason for controlling for gender. I do a little online research of male/female differences, and I discover a study that suggests that men are more reckless than women. In presenting this post hoc explanation in the paper, I say something like, "Prior research has demonstrated that men are more reckless than women. Therefore, because swimming after eating ice cream may be a reckless act, we controlled for gender."
"Sounds plausible, right? But it is utterly dishonest and disguises the desire to include gender simply because it produces p < .05 for the partial coefficient for ice cream consumption."
-------------
First, (again and again), finding/assuming a *mere correlation* between Z and X is no basis for a defense as a control variable as this correlation may stem from Z being a cause, mediator or collider. Hence, the causal role of Z is critical. With regard to gender, this would be unproblematic (as ice-cream does not change your gender).
Second, your phrase "because swimming after eating ice cream may be a reckless act" is completely vague. The phrase rather seems to be some kind of evaluation of swimming-after-eating-ice cream than a causal claim. In addition, the phrase refers to only some people in the population (those who do that). And finally, describing an act as reckless is not the same as recklessness as an attribute of an individual (either on a state or even trait level) which a causal claim would require.
A proper defense of including gender (due to differences in recklessness) would be to argue that recklessness causes people to eat more ice cream AND leads to more deaths (e.g., by careless driving, bungee jumping, and...drowning).
THIS causal setup would create a spurious relationship between ice cream and death. While the death-effect is surely highly plausible, I doubt that recklessness causes ice cream consumption. Hence: No sir, I as a reviewer would not buy that. But even if I would do, it is even less plausible that eating ice-cream makes you more reckless (leading to a mediational structure in which the ice-cream-death-effect is eliminated by controlling for recklesseness). Hence, again: No harm of controlling that in terms of a bias.
In the following posts, you refrained from the idea of just reporting correlations (that calms me :) but then you become inconsistent when you write that if one has a more complicated model, then an SEM would be appropriate. An SEM however, even more requires a set of causal assumptions and if these assumptions are wrong, the estimated effects may be biased up to complete bogus. And: In an SEM, you have multiple controlling-processes. Thomas, you cannot have it both--to discard control variables but to be in favor of multivariate analytical methods.
In the final post (directed at David), you wrote that effects may not necessarily biased when being confounded. Yes they will, as simple path tracing reveals that an estimated effect is a result of all open paths linking X to Y (and one open path runs via the confounder). As a result, this open-non-causal path will add to the estimated effect (depending on the set up of all involved correlations, resulting in an upwards or downwards bias. As assumed before, most often in an upward bias).
Then you write some absolutely correct things (i.e., that controls are often vague).
I propose to decrease the vagueness!
In other fields there is a tremendous development happening and we as organizational researchers are missing the train.
Fortunately, there are some shining exceptions, e.g., as the just-published paper here:
Wysocki, (in press). Statistical control requires causal justification. Advances in Methods and Practices in Psychological Science. doi:10.1177/25152459221095823
https://journals.sagepub.com/doi/10.1177/25152459221095823
All the best,
Holger
Mis-specification will occur anytime an omitted variable is correlated with both and IV and a DV. It does not require moderation (or mediation, for that matter).
Thanks, David and Holger, for your most recent comments. I'm now stepping out of my role as the "champion" of deleting control variables. Just a few final points from me:
1. I'm convinced from the discussion that there may be situations where statistical control is justified. I remain wary, however, of the use of CVs and how they are chosen, used, and reported in many articles.
2. It seems to me that, in most instances, if researchers are going to take the time and effort (as they should) to explain and model the role of Z (a possible CV) in a study, they might as well treat Z like any other variable of interest rather than as a CV.
3. Although our discussion didn't get into all the details, I believe a fair amount of statistical legerdemain occurs in science (I'm happy to provide references). Clever manipulation of so-called control variables, as in the case of pseudo-suppression, is one form of "questional research practices." In my view, much such behavior occurs due to dysfunctional incentives to publish, get grants, get tenured, and get promoted.
4. Although misspecification occurs if a study omits a relevant variable, serious mistakes can be made by mis-modeling variables, including CVs, that are included. Also, note that it is often the case that one or more potentially relevant variables are not included in a given study. For instance, there are only so many measures in a survey that can be included if one wants a decent response rate. My point: We should be at least as cautious about how we model CVs (if we include them) as we are about misspecification issues.
That's it from me, my friends. I wish you the very best of luck with your research, and in all things. (Of course, feel free to carry on the conversation. I have simply run out of things to say.)
Warm wishes, Tom
Hi Thomas,
thanks for the nice words and summary to which I agree to 95%. I especially concur with you being wary and I experience the dubious choice of CVs every day. I still think that Z hacking (finding spurious effects after controlling Z) will happen only when Z is a collider but only started to think about this phenomenon (thanks to you).
David L Morgan One of my life time goals is to convince you to think about the necessity to control Z not in terms of correlation but in terms of confounding. And yes, accidentally controlling for a mediator happens every day and is responsible for death of thousands of effects of valid predictors :)
All the best,
Holger
What difference would it make if I wrote my replies in terms of partial regression coefficients instead of partial correlations?
Hello David,
my response referred to you saying that ...
"Mis-specification will occur anytime an omitted variable is correlated with both and IV and a DV. It does not require moderation (or mediation, for that matter)."
This is a insufficient basis for the inclusion of control variables (and the notion of misspecification) as mediators and colliders also "correlate" with X and Y.
If the true data generating process runs via X-->M-->Y and you just perceive X and M as two predictors/causes of Y (and fail to consider the X-->M part) than your regression or partial correlation coefficient will only reveal the direct effect of X (as the path running via M is eliminated/blocked). In case of a full mediation structure, you get nothing (as there is no direct effect) and you will draw the wrong conclusion that X has no effect. This is called overcontrol bias.
The same goes with accidently controlling for colliders (common outcomes) or their simpler version of controlling for an outcome of Y (which is the same problem and represents the core problem of selection bias). A nice and popular example of this is the birthweight paradox (showing that low-weight babies have a higher survival chance when their mothers smoke) or the famous suppressor-variables problem.
In cased you missed that, this will be an eye-opener:
Wysocki, (in press). Statistical control requires causal justification. Advances in Methods and Practices in Psychological Science. doi:10.1177/25152459221095823
https://journals.sagepub.com/doi/10.1177/25152459221095823
All the best,
Holger
Okay, I have to jump back in to underscore a point I made earlier. If one follows the procedure suggested by Wysocki, why even call a given variable a "control variable?'" Why not just call it another variable included in the study?
Hi Thomas,
it does not matter how you call it. The main issue is that for ANY effect of any X on ANY Y, you have to think about potential confounders (or to be more general: variables lying on a confounding path, which refers to Fig. 3B/C in Wysocki et al.). Hence, a variable Z having a "controlling function" for a pair X-->Y can at the same time be a interesting variable for its own merits (and perhaps the X variable has now the controlling function for the Z-->Y effect).
Its all about--how Felix Elwert called it “to strip an observed association of all its spurious components” (Elwert, 2013, p. 247)
Furthermore, in an SEM, for instance, there are several pairs of supposed cause-effect pairs and a set of control variables may be the same for all pairs or pair-specific.
Yes all of this is hard and unlikely to be 100% successful (which is often presented as a counter argument). I see it as at least to achieve a theory-based choice of controls instead of mere customs or this ubiquotus "because it correlates with..." phrase. With every reasonably identified control variable, you reduce the bias a bit more. This is at least something. IMHO.
All the best,
Holger
Elwert, F. (2013). Graphical causal models. In S. L. Morgan (Ed.), Handbook of causal analysis for social research. (pp. 245-273). Springer. https://doi.org/10.1007/978-94-007-6094-3
Okay, Holger! So if it doesn't matter what we call variable Z, why don't we delete the idea of control variables? This would avoid confusion and require researchers to specify the nature of the relationships (causal or otherwise) between Z and the other variables in a model.
Traditional SEM analyses have done exactly that for decades and just incorporated masses of "interesting" variables without any identification strategy for a respective path (besides the likewise desastrous issue of ignoring model misfit). I did all of that, too.
Nowadays I am much more modest and would focus on simpler models with 3-4 target effects but -- probably a bunch of "auxiliary variables" that have the sole reason to identify these effects. It is rather a philosophy as it is practiced in econometrics which is so much more focused on nailing down a single effect. In this regard, thinking about instrumental variables can be fruitful (if it works).
By the way, I do not like the term "control variables" because it represents one more example of mixing causal with statistical issues (you don't "control" anything when doing a regressions analysis). Consequently, the causal literature rather uses "conditioning" or "adjustment". See Morgan & Winship
Morgan, S. L., & Winship, C. (2007). Counterfactuals and causal inference: Methods and principles for social research (analytical methods for social research). Cambridge University Press.
Holger
Tom, I agree. A "control variable" should have the same status as all the other variables in a study. A control variable is an assumed causal variable that can explain the relationship between the two (or more) focal variables. If you are testing a model, it should be part of that model and not just an add on that in many cases doesn't show up in the figure.
Coincidentally, the other day I came across the Wysocki paper Holger Steinmetz mentioned above and it inspired this week's blog: https://paulspector.com/statistical-control-is-academic-rigor-theater/
Hello Paul,
nice to have you here.
I partly agree with the recommendation that a control variable should be part of the model and not be missing (especially with regard to their expected set of effects) in a path diagram nor with regard to the causally-inspired identification strategy underlying this set. At the same time, they have only a functional value for identifying the effects of the target variable--hence if you want to interpret the effects of the control variables than you would need an identification strategy for them too, as Paul Huenermund writes nicely in his blog.
https://p-hunermund.com/tag/control-variables/
Of course you can err with regard to the causal role of the controls as well as with regard to their completeness (you will fail almost always to be complete) but at least, you have a precise, transparent, and assessable set of aussumptions instead of including them simply by customs of the field or vague statements about the controls "being correlated with X and Y".
That's essentially the goal of the whole thread (and I include Thomas' efforts here, too), namely to change the theater into reasonableness.
All the best,
Holger
I like this idea very much, your paper with Dieter Zapf and the others (Spector, P. E., Zapf, D., Chen, P. Y., & Frese, M. (2000). Don’t throw out the baby with the bath water: The role of negative affectivity in reports of job stressors and strains. Journal of Organizational Behavior, 21, 79-95.) is so well written, so well developed . For instance: age: with an aging workforce, we know more than ever should not "control" for age but look for it to understand the effects. Or gender an other "control" variable that needs more attention. Or emotional stability, often times excluded from the "explained" variance, but why? Thank you Paul for this initiative!
Hello Andrea Fischbach ,
I hope you don't mind that I respond despite you addressed Paul.
First, I would advise to read the thread. It is a bit frustrating to have all these issues discussed forward and backwards and then people start all over again.
Second, I have problems to understand how avoiding to control for a confounder variable would enhance your understanding. Take for example our ice cream example that provided the main example throughout the thread (the scenario was that there is a relationship between ice cream sales and the number of deaths by drowning). How would ignoring "season"--which is the main creator of this spurious relationship (as it is a common cause) help to understand a) the relationship between ice cream and drowning and b) the role of season?
In my point of view, it is the opposite. As argued several times, the explicit generation of a "causal identification strategy" underlying the control of season (or age or whatever) rests on the assumed causal role of the control variable. Explicitly considering the control variable, according this perspective, will inform you about the role of ice cream for drowning (namely there is none) and the role of the season (namely it probably affects ice cream sales and death by drowning).
If, however, your belief rests on the procedure of "blindly throwing variables in the regression", then I totally agree with your perception. We should stop that.
BTW: here is another recently published paper that shows that perhaps, slowly, psychology will learn :)
Grosz, M. P., Rohrer, J. M., & Thoemmes, F. (2020). The taboo against explicit causal inference in nonexperimental psychology. Perspectives on Psychological Science, 15(5), 1243-1255.
All the best
Holger
Andrea and Paul, thanks for your comments! Holger, I found your tone condescending at times, but the content of your messages was great (as were the references you provided). Thanks to all that have participated in this thread.
Take a simple regression model with Y = a + bX + cZ. The distinction between the treatment variable X and the control variable Z is conceptual, not formal, and it depends entirely on your research question. So, why should we distinguish between the two variables? If I am now examining the causal effect of X on Y, and the variable Z is the only confounder, then after I control for Z in my model, the coefficient b at this point is the average causal effect of X on Y. So does c at this point indicate the average causal effect of Z on Y? The answer is no, c does not have a causal explanation, but only reflects a trend in Y as Z changes. If we want c to have a causal explanation, then we need to further control for confounding factors between Z and Y. In other words, controlling for Z removes the non-causal correlation between X and Y due to confounding, making the coefficient b causally interpretable. Thus, although we do not know from the form of an equation which is the treatment variable and which is the control variable, we do and should know which is the treatment variable and which is the control variable, based on our research questions and research hypotheses. This is my personal view for your reference.
Fan Chao, thank you for sharing your thoughts. My friendly suggestion is to read Paul Spector's work explaining how control variables cannot be relied upon to "purify" a relationship - that is, to reveal the true nature of the relationship. Specifying research questions or hypotheses does not give control variables this ability.
Daokui Jiang, can you say a bit more about your position? What idea is false, and why?
I only became aware of this discussion today when Kelvyn Jones pointed to it in another thread (see link below).
https://www.researchgate.net/post/What_is_the_difference_between_having_a_true_control_group_and_controlling_statistically_without_having_a_true_control_group
The current thread is now quite lengthy, and although I have skimmed through all of the posts, it is possible that I missed something. So apologies if this point has already been mentioned.
Andrew Gelman argues that we should say that we are adjusting for other variables rather than controlling for them. You can read his comments here:
https://statmodeling.stat.columbia.edu/2019/01/25/regression-matching-weighting-whatever-dont-say-control-say-adjust/
EDIT - It is now 4 days since my initial post. I have now read the Wysocki et al. article mentioned earlier in the thread, and notice that in their first footnote, they acknowledged Gelman's concerns. Here is what they wrote:
1. Note that many statisticians prefer the term “adjust” over “control” in the context of regression. The concern is that statistical control may be mistakenly conflated with (the stronger) experimental control (e.g., Gelman, 2019). Although we agree that “adjust” is more precise, we continue to use the term “control” because it is more prevalent in applied psychological research.