In regression and correlation analysis we often find find statistically significant relationship between two variables. Before starting the research we also find the possible relation between the variables from previous literature. The common cliche "Correlation doesn't imply causation" negates the validity of all quantitative analysis based on statistical methods. How can a quantitative researcher prove the causality of his statistically significant correlation and regression results?
I totally agree with Ehsan. But since this topic is so often misunderstood, I will answer the same with different words:
Proves are a logical/mathematical thing, and they are in principle never possible based on any statistical analysis of empirical data. This is one major misconception (that data "proves" something). Therefore, no p-value in the world is a "proof" of anything. Recall that the distribution of p-values under H0 is uniform. Data helps us to change our believes in models. Strong empirical evidence can convince us that the relationship between two variables is useful (e.g. to predict other things, to "understand" something more).
There are two things that play a role in judging "causation": reasonability and experimental design, where the latter is the stronger.
If you have an observational study, you can only argue with common sense that the observed correlation might indicate a causal relationship. You will need some argumentative support: is it reasonable? would it fit to other things we already know? is there a possible/reasonable physical explanation? and so on. Nevertheless, it is really hard to convince someone that there is a causal relationship just from data of an observational study, since there are infinitively many unthought things that may be overlooked and still play a role, possibly leading to a co-variation of the observerd variables even with no causal relationship.
The golden way to test a causal relation is an appropriately designed experiment, where only the suspected "cause" is varied and the "effect" is observed. If nothing else was systematically different and the "effect" comes after the "cause" (in time), then a tight correlation is a strong argument that there is a cause-effect relation between the variables.
So the whole topic has nothing to do with statistical data analysis. Just with common sense.
I think you can find useful informations here:
http://en.wikipedia.org/wiki/Correlation_does_not_imply_causation
and for the Granger causality here:
http://en.wikipedia.org/wiki/Granger_causality
The truth is that there exist not so many ways to prove causality statistically and I think it is a very interesting research field.
In a study the only difference between correlation and regression is a LOGIC OF DEPENDENCE. It means that the correlation between two variable doesn't imply the dependence of two variable but logic does. For instance, there is a correlation between atmosphere CO2 and photosynthesis. From the viewpoint of biochemistry (here chemistry is logic) there is a dependence of photosynthesis on CO2 concentration. This was an example of regression and the R2 may be very high or low depending on other environmental parameters.
A simple example of merely correlation is the amount of time that girls and boys stay at class in a school. It is obvious that in a class both boys and girls stay till the end of lesson, so there exist a very very high correlation between them (the more the girls stay, the boys do). But the time that girls stay at class do not depend on the boys, bot it depends on the lecturer or teacher.
So, suppose we do not know that. To find out what is happening we do not permit boys to go to school for some days, but we see the girls go again. So, there is no dependence between them but they depend the time a teacher stays at the class.
Another way which is STATISTICAL for this example is to tell the boys (or girls) to leave the class at exactly, for instance, 1:30 after the beginning. But we see other group stay more or leave sooner. So the correlation between them lowers significantly.
Totally, i should say to answer this question we should take one of the variables, which we think it can be the regressor, under control (and do not make any change in other variables), if the correlation do nit change or changes slightly between them we assure there is a dependence between them. But if the correlation change significantly* we can assume there is no regression relationship.
Although correlation do not imply regression, but we can make up a regression model between them with caution.
* It should be noted that if the correlation changes significantly and still the coefficient is significant statistically, we should make the changes in a way to lower that under significance level or we may doubt abut regression/correlation relation.
Hope it helps
Dear Abu, I made some changes to my answer, please read it again to get more answers.
Abu,
This is indeed an active area of investigation. My hunch is :-)
1. Both independent and dependent variables need to be measureable quantities that belong to the same system. e.g. sunspots and convection current strength/flow in the Sun may have both a correlation and a causal relation. the dependent and independent variable will be connected by an equation/law that fits the data perfectly using integer degrees with a nonsignificant constant.
2. The variables may be part of seemingly two different systems but are connected by 1 direct interaction (charged particles) that connect them leading to partial and indirect causation. This will show up as a medium-high correlation but the non-linear equation would have additional terms and a larger constant. e.g. frequency of communication satellite breakdowns and sun activity. Some satellites will be more robust, some may have orbits that affect their chances of radiation exposure..
3. Where there's a chance correlation like Columbia river salmon runs and 11 yr sunspot cycle.There's no single direct interaction at the given moment that will affect salmon. Sun activity, magnetic fields thus affected etc could be 1 clue but among many.
I totally agree with Ehsan. But since this topic is so often misunderstood, I will answer the same with different words:
Proves are a logical/mathematical thing, and they are in principle never possible based on any statistical analysis of empirical data. This is one major misconception (that data "proves" something). Therefore, no p-value in the world is a "proof" of anything. Recall that the distribution of p-values under H0 is uniform. Data helps us to change our believes in models. Strong empirical evidence can convince us that the relationship between two variables is useful (e.g. to predict other things, to "understand" something more).
There are two things that play a role in judging "causation": reasonability and experimental design, where the latter is the stronger.
If you have an observational study, you can only argue with common sense that the observed correlation might indicate a causal relationship. You will need some argumentative support: is it reasonable? would it fit to other things we already know? is there a possible/reasonable physical explanation? and so on. Nevertheless, it is really hard to convince someone that there is a causal relationship just from data of an observational study, since there are infinitively many unthought things that may be overlooked and still play a role, possibly leading to a co-variation of the observerd variables even with no causal relationship.
The golden way to test a causal relation is an appropriately designed experiment, where only the suspected "cause" is varied and the "effect" is observed. If nothing else was systematically different and the "effect" comes after the "cause" (in time), then a tight correlation is a strong argument that there is a cause-effect relation between the variables.
So the whole topic has nothing to do with statistical data analysis. Just with common sense.
"Correlation doesn't imply causation" ? or "Correlation may not imply causation" ?
Dear Subrata, it doesn't by it self. But we can use it in some way to imply causation. Combining Correlation with logic and control we can do that.
I agree with Jochen Wilhelm, the best way to test a causal relation is to design an appropriate experiment. However, in many situations (for example when we are willing to test such relations on nationally representative surveys) only non-experimental data are available. In these cases we can use some specific techniques, such as instrumental variables or propensity scores, that allow to test (at least approximately) for causal relations in a non-experimental enivironment.
Dear Emiliano, you've noticed an important point. Here I should say that the aim of regression models divide in two important things, prediction and control. With the aim of prediction any correlated variable can be used regardless of dependence. For the data you've mentioned, we can only do predictions and not control. But, if we want to control s.t. there should exist a dependency, such as increasing the yield of a crop using different amount of manure, irrigation, etc.
@Ehsan Khedive
Yes. We do not compute and test the significance of correlation between any two variables.Its crucial to have prior knowledge about their dependency / causation before we decide to go ahead with correlation.
R.A. Fisher, when discussing the use of observational data to demonstrate causality said to "Make your theories elaborate" , meaning if a process is causal, all it's logical implications must simultaneously be true.
This was the basis for W.G. Cochran's (1965) work to link cigarette smoking to lung cancer using observational data. This topic is treated well by Rosenbaum (Observational Studies) as well as Gelman and Hill (Data analysis using regression and multilevel/hierarchical models).
I disagree that use of adjustment techniques like propensity scores, or any similar ANCOVA like approach is adequate without looking at multiple competing hypotheses.
To see this, consider a situation where all measured ancillary covariates are exactly equal across treatment groups. In this case there is no adjustment necessary and the analysis is a simple ANOVA or Regression, which as we know still only supports association, not causation, in the absence of a designed experiment with randomized assignment of treatments and random selection of subjects.
The reason for this is that without randomization there may be one or more lurking unmeasured variables that are causative. This is why Fisher advocated the elaborate theory approach. While adjusting for measured covariates is critical to minimizing bias and spurious results, these approaches in and of themselves do not lead to a conclusion of causation.
Failure to recognize this limitation is the reason why Red Wine is good for us one week and bad for us the next!
I shall not agree that the situation is so easy as it was described by Jochen and Ehsan and I shall rather agree with some aspects of John's thought. When we increase an index like R^2 or like F it does not necessary imply that a causality effect is present, since we cannot easily distinguish between the arrow of interdependence. In order to decide for the causality we have to be absolutely sure about the arrow: x-->y, x causes y and NOT vice versa. That's the reason why I still insist that we have an unresolved and interesting problem here...
Dear Demetris, I think it can be resolved. As you said, Y depends on X and not vice versa. So, if we get the control of X and change it as we want, the correlation coefficient must not change significantly ( but it should change as a random variable). But, think of we get control on Y. So by changing Y the correlation coefficient should get lower and lower because X do not depend on Y.
Therefore, we find out two things by using statistics in a logical way. First, we can explore if there is a dependency relation, and second, which is the regressor and which one is not. Although some exceptions may exist that i have not in mind by now.
Hope it Clears the cloud
One important thing i should tell to all friends is: We can not examine the existence of dependence between two variables by applying only one statistical analysis to just one data, but, we should:
1- Take the control of both variables (say we have just x and y)
2- Test the correlation and regression of X on Y and vice versa
3- At lest apply the procedure (controlling the regressor, observing the dependent) for 2 or 3 times.
Any more suggestions would be appreciated.
Hello,
First, we must develop a rational theory and hypothesis. Second, we must design an experiment or field survey, and collect data. Third, regression and correlation only gives us a supporting evidence for our hypothesis or theory. This order of steps never should be reversed.
From observations with errors, it is not possible to reach a proof, which could be defined as a 100% confident conclusion. Observations will reach a conclusion to a specific percentage confidence level. A high confidence may be reached, but any day, someone may devise a situation in which the proposed mechanism fails.
The best two plots I have seen to show this were time series over many years of birth rate in Norway and Stork population. A coincidental correlation resulted from a greater birth rate yielding more chimneys in which Storks would roost. A second example was time series of sun spot activity and republicans in congress over several sunspot cycles. The correlation was very high. Yet we would all be hard pressed to seriously postulate a mechanical linkage. Another good example is almost every environmental variable (such as temperature in Australia) has an annual cycle, and so does the Dow Jones Industrial Average. Let's assume for the moment that a significant correlation exists between the two. Does the Dow Jones Industrial Average drive the temperature in Australia? Again, we would be hard pressed to pose a realistic mechanism by which this would occur.
All correlation may show is that observations are consistent with a proposed theory.
From observations with errors, it is not possible to reach a proof, which could be defined as a 100% confident conclusion. Observations will reach a conclusion to a specific percentage confidence level. A high confidence may be reached, but any day, someone may devise a situation in which the proposed mechanism fails.
The best two plots I have seen to show this were time series over many years of birth rate in Norway and Stork population. A coincidental correlation resulted from a greater birth rate yielding more chimneys in which Storks would roost. A second example was time series of sun spot activity and republicans in congress over several sunspot cycles. The correlation was very high. Yet we would all be hard pressed to seriously postulate a mechanical linkage. Another good example is almost every environmental variable (such as temperature in Australia) has an annual cycle, and so does the Dow Jones Industrial Average. Let's assume for the moment that a significant correlation exists between the two. Does the Dow Jones Industrial Average drive the temperature in Australia? Again, we would be hard pressed to pose a realistic mechanism by which this would occur.
All correlation may show is that observations are consistent or inconsistent with a proposed theory to a quantitative confidence level.
1. The statement "Correlation doesn't imply causation" is of lower importance for a researcher than "There is no causation without correlation". That's why everybody in hard sciences use statistics to prove something (exiting statistic analysis was used by Higgs boson team.) . In that sense there is no alternative to correlation to reveal potential causation.
2. The difference between causation and correlation is that the latter may fail when new data are obtained from lomger or more accurate observations. One can never say, however, that data is enough. In physics, the gap between actual and 100% correlation is the frontier of research. Here we mention photoeffect, dark matter, etc.
3. At the end of the day all that physics knows about this universe is a number of correlations between measured values. Every researcher wants to find the case when these correlations fail. Only such failures provide real breakthroughs. So, keep going with correlations and forget "Correlation doesn't imply causation".
@Ehsan, take the attached data and tell me rigorously:
x-->y or y-->x
and why?
Dear Demitris, I've told before to find the relation we must take control on variables. I meant, control one variable and again gather data. For example, say we do not know there exist any dependency of photosynthesis on sun light, but we have a dataset which shows a high correlation. Now read precisely what I say:
Plan an experiment and gather a new dataset ( the things i've said on my last comment). Control* the amount of sunlight which reach the plant leaf and simultaneously measure both sunlight and photosynthesis rate. You can see there again exist same correlation and it didn't change significantly. Now you can be sure that photosynthesis depends on sunlight.
* My idea of control is to allow any amount of sunlight you want to reach the leaf, by shading or filtering the light.
I hope it explains my Thoughts.
Dear Ehsan, unfortunately in most of the times we simply cannot repeat or alter the experiment: it is historical data, so we can only use our statistical tools. If we suppose that my data is from historical observations, then how can you proceed?
I would agree with Jochen Wilhelm that the golden way to test a causal relation is an appropriate experimental setup. However, when you deal with health or environment, an 'experiment' which varies just one variable, is usually impossible. In the worst case you are just given a heap of data. In this situation the best approximation to "testing the dependence of Y on X, while all other conditions are the same" would be to consider all factors available and to test whether a correlation X ~ Y persists, when conditioned on the other factors. If so, the belief in a causal relation is strengthened. However, there could be always hidden factors not yet investigated - that could even lead to a reversal of the correlation as in instances of Simpson's paradox. Thus a causal hypothesis can only be honestly maintained if you keep looking for factors hitherto unnoticed. (A certain amount of such statistical evidence for a causal hypothesis would be a good reason to look for a theory, a mechanism of how this causation works. Without a plausible mechanism I would never be too sure.)
The question arises whether this procedure can be formalized beyond "common sense" - a question of the philosophy of science and of computer science at the same time (the latter with the perspective of 'automatically' extracting the causal models most appropriate to a given data set; not of "proving" a causal model). I am not very familiar with the literature about this, but I remember models proposed by Judea Pearl http://bayes.cs.ucla.edu/BOOK-2K/index.html.
Dear Demetris, for such data we have two way to find dependency. 1- Logically, as i Said before, or 2- Statistically, by analyzing any data around the world ( in this method an expert statistician is needed). For instance, if the data belongs to a country, gather data from some another countries (this method takes time to explain, you can consult with a local statistician expert in this method).
Abu, for non-Gaussian data there are methods to test for causation, e.g., structural equation modelling (SEM), estimation of linear Bayesian networks. For acyclical models, see papers by Shimizu and others. For Gaussian data, you will need nonlinear networks.You could also discriminate between x->y and y->x by using a likelihood ratio test. See paper by Dodge and Rousson. There are also three-way extensions to this.
The examples Jacobs refers to are illustrations of Yule's nonsense correlation (there is also one recent funny one correlating between chocolate consumption and chances for winning a Nobel Prize!). These are often due to a latent (hidden) factor between both variates. Once you know that factor, you can easily test for it using partial correlation coefficients; the difficult usually is trying to find that hidden factor.
Hope this helps.
Just to add that one possible vehicle for realizing SEM is independent component analysis, among others.
I agree with all the previous comments, the question is very complex. And in my opinion it does not have a unique answer. Indeed, I think that the solution to adopt also depends on the field of study and on the available data. For example, in the policy evaluation field propensity scores and instrumental variables are widely used because the analysis is often performed on secondary non-experimental data, while in other fields such as mechanical engineering designed experiments are much more frequent.
So, in my opinion, the best way to approach the causality problem is to rely on field-specific theoretical frameworks and field-specific statistical techniques.
Of course, in any non-experimental setting one can only find strong evidence of a causal relation and consistency with previous findings and theory, but not a definitive proof of causality.
I don't think that the question is complex. It is simple. Causality is an epistemic concept and not a matter of data analysis or statistic figures. Things start to become very complicated when you start confusing "data" and "meaning".
Of course, nothing is to be learned from the confusion of epistemical and statistical concepts. However, it is legitimate to ask for the relation between both, a question which I do not find simple at all. (Of course, this question is different from the original question of "how to prove causality", but it is the next best question to ask after stating that causality cannot be proved by statistical means.)
Arguably, this type of question belongs to philosophy of science and definitely not to statistics. There is an extensive literature on how, given certain statistical evidence, one can exclude certain causal models, and prefer others, if one makes certain assumptions. These theories are not trying "to magically pull
causal rabbits out of a statistical hat" (Scheines in http://mlg.eng.cam.ac.uk/zoubin/SALD/Intro-Causal.pdf), but rather to make explicit some assumptions underlying scientific work. A survey on causal modeling can be found in section 3 of http://plato.stanford.edu/entries/causation-probabilistic/. It should be mentioned that in this setting the notion of 'cause' is itself probabilistic, a "cause increases the probability of a certain effect A" (as opposed to classical: "a cause is a necessary condition in a sufficient set of conditions for the outcome A").
@Jochen Wilhelm: ok it is a simple concept in theory, I can agree with that.
But empirical testing of theoretical causal hypothesis can be very complicated, especially in fields like social sciences. And data and statistical models are a central issue. Analyze appropriate data with the appropriate statistical model is the only way to obtain strong evidence of a causal relation in many fields where it is not possible to design experiments.
I don't think I'm confusing "data" with "meaning", I just think that "meaning" has to be confirmed by "data". So, in my opinion, causality testing has a lot to do with data and statistical analysis.
Yes Stefan, I am with you. And because it has "dfinitely nothing to do with statistics" we should not pollute this thread with talking about causation (more than "it has nothing to do with statistics!"). Recall the question was particularily about statistics, talked about "correlation" and "statistical significance", and the first topic given is "Statistics". In my eyes it is deleterious to make statistics seem so complicated and complex simply by the bad behaviour that things that are not related to statistics are discussed and entangled with statistics. The particular problem here seems to be the fact that statistics deals with uncertainties whereas the believe in a causal relation is an "all-or-nothing" statement (either something is causal or not). This was tried to be circumvented in the work of Christopher Hitchcock you cited, but this does not solve the epistemic problem, it just assimilates a word or phrase into a statistical framework. It may also be related to the misconception of hypothesis testing, where statistics are used to make definite decisions (yes/no wrt. significant/ns), and it is (falsely) believed that we could use this to indicate a correctness of the decisions.
PS: I also think the philosophy about causation is difficult and interesting, but I don't see what it should have to do with empirical/natural sciences. I also find the concept of "cause and effect" attracting and useful, but it is actually not more than a crutch for things we can not sensibly express otherwise.
I agree with Stefan Born, definitive proof of a causal relation cannot be given by statistics itself. However, some evidence of causality can be given by an appropriate statistical modelling together with a solid theoretical framework. And this is the best one can do from an empirical point of view.
But if you consider the question only from the theoretical point of view, then I agree that it has nothing to do with statistics and data. I just don't see the point in doing only that; theory has to be supported by empirical findings.
I remind that Clive Granger awarded the Nobel Prize in Economics 2003, see details:
http://www.nobelprize.org/nobel_prizes/economic-sciences/laureates/2003/granger-lecture.pdf
for the concept of causality he had introduced.
Another good link:
http://www.scholarpedia.org/article/Granger_causality
So, the topic is very interesting, I think.
Thanks Demetris!! About the scientific modelling of causality see also this contribution by James J. Heckman:
http://jenni.uchicago.edu/discussion/Heckman_SocMeth_v35_2005.pdf
Just for the history of the discussion: If we do the granger causality test for the data I have uploaded we see that (R package vars):
VAR Estimation Results:
=======================
Estimated coefficients for equation x:
======================================
Call:
x = x.l1 + y.l1 + x.l2 + y.l2 + const
x.l1 y.l1 x.l2 y.l2 const
-0.06604502 -0.54796458 0.16535112 -0.64960910 1.37846457
Estimated coefficients for equation y:
======================================
Call:
y = x.l1 + y.l1 + x.l2 + y.l2 + const
x.l1 y.l1 x.l2 y.l2 const
-0.0400407726 -0.2956158987 -0.0005905655 -0.2151711230 2.6512206693
=========================
$Granger
Granger causality H0: x do not Granger-cause y
data: VAR object yy
F-Test = 0.0316, df1 = 2, df2 = 86, p-value = 0.9689
=========================
$Granger
Granger causality H0: y do not Granger-cause x
data: VAR object yy
F-Test = 1.4504, df1 = 2, df2 = 86, p-value = 0.2402
=========================
So, both times we accept H0 and conclude that neither x-->y nor y-->x. Of course we are stronger accept that 'Granger causality H0: x do not Granger-cause y' than 'Granger causality H0: y do not Granger-cause x', so we could assume that maybe y somehow Granger-cause x.
I lhave not looked at all other responses, but I noticed that Jochen Wilhelm's answer discusses time varying data, where some values go occur before others. Do a search on "Granger causality." You will find that you still cannot "prove" anything with data, but with logic, you can derive hypotheses which your data analysis can support. Granger causality still does not prove causality, but suggest possible directions of causality and unlikely directions of causality. This can be used to reduce the potential "other reasons or other causes."
Demetris, you have provided and anaylsed some data. Either I missed it or it was not done: a planning of the experiment, given a defined effect-size and a power. In this case you cannot interpret "non-significant" results in any way. You state that "So, both times we accept H0", what is a nonsense conclusion in this setting. It is just not possible to reject H0 while retaining the guarantee to keep a long-run type-I error-rate, what does *not* mean that H0 is accepted. Further, it seems to me that you compared p-values in order to decide which causality equation may be preferred. However, given H0 both p-values would be observed/expected with the very same probability! Am I getting things wrong here?
Jochen, I present the results and this does not mean that I agree with the process (it is not mine anyway). Since for both cases the p-value is great, then we have to accept the H0 in both cases. Only if we accept that an error of 24% is not 'big', then we can say that we reject H0, but it is a great error anyway. It is an illustration of a given R package.
Yes, Demitris, I understood that it is just an example and not your own opinion. However, a large p-value is no indication of the absence of an effect. Rejection of H0 when p=0.24 would not mean that the error for this rejection is 24%. It just means that your long-term type-I error rate can be up to 24%. On the other hand, when you accept H0 because p=0.24 but the power is 2%, then you will have to live with a long-term type-II error rate of up to 98%. Is that sensible?
Fisher said, a p-value is only interpretable as a measure of the evidence in the data *against*(!!) H0 (and never alone, only in conjunction with all other circumstances). Neyman said, even this was wrong and a p-value doesn't tell you anything at all; only the long-run rejection rule guarantees that the long-run error rates of decisions won't exceed a given limit (actually both, alpha and beta must be fixed a priori bacause otherwise no alternative decisions can be made, but it is common practice not to fix beta, so only one of the decisions can be taken at all... Neyman said that "not taking a decision (to reject H0)" is actually also a decision, but with uncontrolled error rates, so this testing strategy is neither Fisher nor Neyman).
Whatever you think, look for causal relations does not make sense in the social sciences field.
You can find causal relations only if you are working on experimental research, and it is not the case for the social sciences.
As matter of fact, since the social sciences cannot control the intervening variables, it is impossibile to speak of experimental design.
In experimental research, researcher can choose the most important variables, decide which are the dependent and independent variables, and which variables should be considered constant.
Here, the most important is that the researcher is able to control all these variables. So, saying that some variables are held constant, means that variables do not actually take part in this experiment. It also means that the researcher can vary the value of the independent variables intentionally, knowing that the variation can be controlled.
So, can we do this as social researchers? My answer is no.
Obviously, we can do statistical analysis, but there is no statistical technique can tell us whether there is a causal relation.
Stating that there is a causal relation among the variables is a scientific responsability of the researcher who must support the claims with data and discuss the findings fully, trying to convince everyone else.
I agree with R. Coats, one cannot prove causality using data analysis. Data analysis allows only to find evidence of potential causality when testing theoretical hypotesis. So, appropriate statistical modelling is the way to support potential causal hypothesis with data.
I also agree that the concept of causality itself does not belong to statistical science.
I think that the two concepts (the epistemic and the statistical concepts of causality) have to be used together. I don't see the interest in devolping a theoretical framework without checking empirically the resultant hypothesis, and I don't see the point in performing only data analysis. In my opinion these copncepts are inseparable, at least in applied research.
I am still on my word that we can use statistics to prove causal relationship. But the explanation is out of scopes of this site. Besides, we can not prove it with simply one dataset. Further, the lower our control on data, the higher we need datasets of different situations to prove it.
My last word
@ Stefania Tusini:
Even though for social science there is no experimental design, it may still be relevant to come up with causal models for the findings, whenever science is linked to policy making: If some X - e.g. certain properties of our school system - is convincingly modelled as a 'cause' of some Y - e.g. low social mobility - an appropriate policy of modification of X can be expected/hoped to change Y. Such a causal model would require, apart from statistical data, background knowledge about social mechanisms, which in some cases - e.g. when psychology is involved -. might be tested in an experimental setup. Of course, there is great danger in wrong causal models for social phenomena. [In meritocratic states some members of the elites have a tendency to believe that the children of the poor are innately less intelligent and hence do not succeed in an otherwise fair educational system. This causal model will rather prevent changes to the educational system. I draw the example, somewhat simplified, from an article in a Chilean newspaper.]
Federica Russo analyses the possible uses of causal models in social science:
//blogs.kent.ac.uk/federica/files/2009/11/Explaining-causalmodelling_fullpaper.pdf
Personally, I prefer to analyse casuality by interpreting the phenomena. I like 2 examples:
1. when rains, floor get wet, but it doesn´t mean that you wet the floor it makes rain.
2. Attach a piece of string in a car toy. Hand moves the toy or toy moves the hand? Push the string and it moves the toy, but pull the string and nothing happen.
If you have opportunity to do DOE. Fine. It helps.
And don´t forget spurious relations.
Prove anything it´s great. Hard is to get it.
Jochen, I agree about the power etc., but given the aforementioned data what else could somebody do? Give an R example, if so.
You simply cannot conclude much from such data. The data neither provides (reasonably) enough evidence against "H0:y do not Granger-cause x" nor against "H0:x do not Granger-cause y". This is all. There is nothing an analysis could do, not in R or in any other software. Again, we are left with our arguments *outside* of any data and analysis. What do we know about the (logical, physical) relation between x and y? What can we assume or suspect about their relation, given everything else we know? Which model would be more attractive (fitting better to other models, bein more useful in terms of making sensible predictions): x->y or y->x? Actually, does it matter at all to distinguish the "direction"? If I cannot adjust neither x nor y and just can observe a correlation, then why bother what causes what? And if I could adjust one of them, then why shouldn't I design an experiment that might then be able to tell me if a systematic and planned intervention of, say, x leads to a systematic observed change of y?
I agree with Jochen, what a software can do, hand can too. the difference is only the time.
In the study of cause and effect we have two important steps:
1- Proving whether there exist any causality, no matter what is the direction (X-->Y or Y-->X).
2- Determine the direction.
We can gain the first using statistics, in most cases. But for 2nd phase we need control or logic.
Ehsan, your point 1 is not right. It would just modify the (wrong) statement "correlation implies causation" to the (still wrong) statement "correlation implies causation, but with unclear direction". Or did I get it wrong?
Dear Jochen, no I obviously didn't meant that. I am pretty sure that correlation does not imply causation. My first point is a step after correlation. If we prove that there exist correlation between two variables, in next step we seek to prove whether it is based on causality or not, which is my first point.
i think my answer was a little confusing rather than wrong.
I see. In fact I got it wrong. However :) I think that the two steps then can't be separated. Causation is alsways directional. So when we -logically- assume a causation, this already requires that we have some idea about the direction, doesn't it?
Of course yes. I consider the data as we just do not know anything about the data and just analyzing them. So if they were correlated in different situations that we measure, they probably are related to each other.
Is my idea wrong?
So, Jochen, we end up again to the old question about the limitations of Statistics: since we cannot conclude even for N=50>30 data the arrow of causality, then our tools need improvement (There also exist other situations where the decision is unambiguous with less number of observations).
@Ehsan, if I understand you correctly, you write that correlation implies relation. Sure it does. But this is not related to causation. Everything in this universe is related, and most things are related at least by the fact that the exists in the same space-time. Many relations are very very indirect (look at the (cor-)relation between storks and newborns). In fact, the observerd correlation just leaves open a possibility of a causal interpretation, but also the absence of an observed correlation does this! We do need some logical grounds to talk about causation. Just and only looking at data (or correlations) does by no means get us towards any causal interpretation.
@Demetris, statistics/data analysis does not provide a framework to decide about causality. It is not that the tools are not good enough - there cannot be such tools in principle. It is like using a hammer and a screwdriver and a gripper to judge if a person beliefs in god. You can improve these tools with electronics and hydraulics as much as you want, make them small and big and multifunctional and using cloud-computing power, and they still cannot do the task. (Bad, this example is not perfect, but I haven't found a better one for now; I hope it will make my point clear anyway)
[Edit: typos]
@Jochen, in your last comment you summarized very well the whole discussion. I agree with what you wrote.
Causality hypothesis have to come from a robust theoretical framework, while statistical modelling is the way to check for relations (correlations) supporting our hypothesis of causality. Statistical modelling can give us strong evidence of potential causality, but cannot give a definitive proof of causality.
Traditionally, causal analysis has been undertaken from experimental perspectives involving direct and intentional manipulation of the independent variable. Such approach greatly restricts obtaining explanations especially in circumstances where such manipulation are technically unfeasible. New multivariate designs try to meet this difficulty replacing the direct manipulation of independent variables by mathematical manipulations. Although this practice makes these applications too dependent of inferential statistics, specialized software and large population samples, it is an alternative view to causation beyond correlation analysis. While the assumptions that are behind these designs are still matter of debate, they constitute valid options, at least in the behavioral science research.
Corrlation (covariance) is necessary, but not sufficient to prove causation. An argument for a relationship being causal also requires that you show temporal priority (the "cause" must preceded the "effect" in time), and requires that you show that the correlation is not "spurious", i.e., that the two variables in which you are interested are not covarying as the result of some other variable (measured or unmeasured). For example, if health status covaries with income, is this really a causal relationship where higher income produces better health? Or does poor health status result in lower income? Or do both health status and income increse with higher levels of education? Often we deal with data in which not all potential "confounding" variables have been measured, whiich is why it is more correct to say that one has "built a strong case for a causal relationship" rather than claiming "proof" of causation. (Experiments, where feasible and ethical, come closer to establishing "proof.") An excellent brief and clear discussion of the issues involved in establishing a case for causation in nonexperimental research appears in Chapter 2 of the book "Conducting Health Outcomes Research" by Kane & Radosevich (Jones and Bartlett, 2011)
Science is poorly designed to find the causes of effects. It is better suited to measuring the effects of causes. So we do not ask "does smoking cause cancer?" but rather "how much is your likelihood of cancer increased by smoking?" To answer this latter question we might design an experiment in which we randomly assign people to smoke or not and then look at the differences in the likelihood of cancer.
This is laid out carefully in:
Holland, P. W. (1986). Statistics and causal inference. Journal of the American Statisti¬cal Association. 81, 945-970.
and is well worth reading and internalizing.
I suggest you the book "Causal Analysis" by David R. Heise (freely available at http://www.sscnet.ucla.edu/soc/groups/mathsoc/publications.html)
@Eric Roth, @Lee Crandall: I strongly agree with your writings. New multivariate designs and longitudinal analysis are key issues in non-experimental settings, especially in economics and social sciences where experiments are hardly feasibles.
Byusing existing theories, remember you are testing hypothesis whic were formulated according to theories. Without theories no causations. or alternatively you may set experiments :)
As far as I understand, there are three requirements to establish causality between two variables. They are direction, isolation and association.
Direction means that we should define which one is the causing variable and which one is the effected variable. Direction is established by means of theory or logic. Time ordering is often used as its indicator.
Isolation means that the change in the effected variable is only caused by the change in the causing variables. In other words we should keep the other variables to be constant or unchanged. This requirement is the most difficult to obtain, particularly in observational studies. Isolation can be achieved by research design as well as data analysis. In experimental studies, we can set all other variables to be fixed for all observation units. In observational studies, we can select observation units with homogenous characteristics, which are very difficult to obtained or against the study objectives. If we cannot control one ore several other variables which may influence the effected variable, we should include these variables into the data analysis.
Association means that the expected change in the effected variable due to the change in the causing variable is not zero. Association can be represented as a mean different, a Pearson or Spearman correlation coefficient and a regression coefficient. For instance, In a matched pair or a pre-post study, since the pair are identical i.e. the other variables are controlled to be constant in each pair, a mean different or a standard correlation represents a degree of association. In the presence of one or several uncontrolled variables, we should include the variables in the analysis. For a simple case, we should use a regression coefficient. Alternatively, If we use a correlation coefficient we should use a partial correlation coefficient.
To sum up, whether a correlation coefficient or a regression coefficient can be used as a causality measure, as long as the direction and the isolation requirement can be established. Note that there are many factors that invalidate a coefficient regression as a causality measure.
Detail of this discussion can be found in Bollen (1989) "Structural Equation with Latent Variables".
Two points:
1. The application of Experimental Design logical rules such as adequacy of control groups presence, capacity to manipulate the independent variable, no regression effect to the mean, no mortality in "before-after" time, etc.
2. Principles of logic: "the night follows the day but one is not the cause of the other", but one is not the cause of the other; take care of tautologies...this is very common, e.g. "first job prestige is the cause of actual job prestige"...This is more a control variable than a causal one. Thus all other variables introduced may be claimed as the causes that explain movements between first job and actual job.
..
Since the time of Aristotle there are 4 causes: material, formal, genetic and functional. The first two are more descriptive while the last two are used more in the Social Sciences, being the third universal in the Natural Sciendes, e.g. "heat expand bodies"....Weber for instance is more inclined to functional one: "protestant ethics (values, goals, ends...) generated the capitalism spirit". That's all...
What constitutes "proof" of causation depends on one's world view. I agree with previous comments since I was raised & trained in an empirical environment. For those whose world view is religion-based the only proof necessary is faith.
Western science is dominated by logical positivism which has generated many different criteria for establishing proof of causation. (See Hume ... or Bacon if you're a radical empiricist.) The simplest set of criteria that I've run across is that: A can said to "cause" B if (1) A & B covary (association), (2) A precedes B, & (3) all other explanatory variables have been eliminated. You're right in questioning (3) ... but that's a core weakness of empiricism ... one negative result is sufficient to disprove a theoretical assumption ... an infinite number of associations is never enough proof. Witness the essentially infinite reductions that our "strongest" science, physics, seems to suffer. The inductive-statistical model of scientific investigation yields especially weak evidence of causation. Evidence of statistical association is necessary for causation, but woefully insufficient.
the standard procedure in econometrics is to test for Granger causality
http://en.wikipedia.org/wiki/Granger_causality
for an application with state see e.g.
http://www.youtube.com/watch?v=q7ESJdhSBJk
However, in amore general sense causality might extent beyond the cause-effect-approach. see e.g. quantum entanglement
http://en.wikipedia.org/wiki/Quantum_entanglement
One way to "proove" causation is to exclude (all) other reasons for the correlation you have measured. The two most important are:
1) Reverse causation and
2) common cause (of alleged cause-variable and alleged effect-variable).
This is kicking a dead horse -- you can never "prove" that something is a cause. Finding the causes of effects is always a theory, subject to revision as you learn more. Whereas measuring the effect of a plausible cause is what scientists do as a routine transaction of their lives.
What follows is an extract from my 2014 book Medical Illuminations that illustrates this in the most iconic of epidemiological findings of a cause. John Snow's discovery of the London Cholera epidemic of 1854:
Finding the cause of effects is a task of insuperable difficulty. As an example let us reach back into mid-nineteenth century London when a cholera epidemic was killing many of the inhabitants of St. James Parish. Dr. John Snow (1813-1858), a founding father of modern epidemiology, prepared his now famous map that illustrated how he was able to trace the cause of the epidemic to drinking from the Broad Street pump. He removed its handle and within days, the epidemic that had taken more than five hundred lives, sputtered to an end.
Figure 1.1 A map of the area near the Broad Street Pump, taken from John Snow’s 1855 book on the communication of Cholera. The shaded bars indicate the number of deaths at that location. The pump is at the center of all of the deaths.
Does John Snow’s success with this map support the case for making maps of unadjusted cancer data and some plausible causes public? Did it help Snow find the cause of the epidemic? What was the cause of the 1854 London epidemic? Was it the water drawn from the Broad Street pump? Or perhaps it was Frances Lewis’ feces that leaked out of a nearby cesspool? Frances Lewis was a five-month old child, who lived at 40 Broad Street and perished from cholera; she is widely considered to be the index case of that epidemic. Or was it the bacteria Vibrio cholerae in those feces? The Italian Filippo Pacini is now credited with being the first to identify this bacterium as the proximal cause of cholera in 1854; ironically the same year as the London epidemic. Is it really Vibrio cholerae? Or is it the enterotoxin that it generates? You get the idea. As we learn more our judgment of what is the ‘true cause’ keeps shifting. It is likely that at some time in the future research will reveal that it is some peculiar protein that interacts in an odd way to cause the disease. And even that is unlikely to be the end of it.
This is almost always the case when we try to find the cause of an effect. But measuring the effect of a cause is easier (although by no means easy). And more important, once measured, it is eternal. Although what John Snow determined was the cause of the 1854 cholera epidemic has shifted over time, the effect of drinking from the Broad Street Pump – the end of 570 lives – remains true.
Modern epistemology focuses on measuring the effects of causes . It is deemed more fruitful to set aside questions like “does smoking cause cancer?” in favor of the quantitative question, “how much is your risk of cancer increased if you smoke?”
If you are dealing with time series you have to look at Granger causality tests or look at cross correlations.
If not look at Bradford Hill criterias for causality . it may help
Correlation implies but does not prove.
Granger causality is the best way to "prove" causality under a set rigorous assumptions. Try to not use the word prove.
Get in touch with an Epidemiologist and study Bradford Hill's criteria for causality that was applied to establish the causation between smoking and lung cancer. However, just based on correlation, you cannot establish the causality. You have to conduct appropriate and feasible epidemiological studies to establish different criteria stipulated by Bradford and Hill.
A correlation coefficient describes the direction (positive or negative) and the strength of association between two variables. The higher the correlation coefficient, the stronger the association. Correlation may be measured through many techniques, such as Pearson, while causation may be proved by using the regression analysis technique. B and beta can show the effect of the independent variables on the dependent variable.
Causality can be proven from correlation if we have sufficient knowledge from the scientific field in question to enable us to rule out alternative explanations for the observed correlation. See for instance J.M. Robins' work especially "Causal Inference from Complex Longitudinal Data" http://link.springer.com/chapter/10.1007/978-1-4612-1842-5_4
Granger causality tests whether one event precedes another. For example, Christmas cards are highly correlated with, and precede Christmas. Common sense tells us that Christmas cards do not 'cause' Christmas, even though they 'Granger cause' Christmas! Please, always place the adjective 'Granger' before the word 'cause' if you apply the Granger test of causation to your data.
It is simple: only controlled experiment can show causality.
I know, that some experiments are nonhuman and ethically impracticable:
We have information from correlation studies, that smokers have a shorter life, than nonsmokers. But it is only information that smoking and living have a relation to live. We automatically (and scientifically wrong) suppose, that smoking shortens life. But it is fair to assume also that for some reason people are predisposed to have a shorter life predisposition as well as smoking cessation. Only a conclusive experiment - we will create a random set of 1,000 persons, which will be checked to smoke 40 cigarettes a day, and a set of 1,000 persons who demonstrably do not smoke. After 30 years, you find that file, in which set was death more people. And of course, such an experiment can not be done. We have not therefore strictly scientific chance knowing whether it is cigarettes, what shortens life, or short life bring dependence on tobacco....
Much of econometrics is devoted to methods for providing good evidence for causality in non-experimental data. If we are really sure that there the explanatory variable in question in a regression is not caused by the dependent variable (e.g. sunspots are not caused by human activity on Earth) then we can treat correlation as causality. Now in the case of sunspots they may cause another variable e.g. rainfall or temperature which then affects agricultural production, which affects agricultural prices, so the variable may not be the direct cause unless we also control for the effects of all these other variables, but it is a deep cause.
Granger casuality is a useful technique for providing such evidence if we can be reasonably sure we have included all other important potential causes in our model but don't have such strong evidence on the exogeneity of the variable we are interested in.
Then there are instrumental variable methods, which depend on finding supposedly exogenous variables that are correlated with the potentially endogenous variables, but there are really some quite tough conditions to find a valid instrumental variable.
I wrote a bit about this in my conference paper from a couple of years ago: From Correlation to Granger Causality. I was invited to speak on the topic of causality analysis:
http://ideas.repec.org/p/een/crwfrp/1113.html
With a designed experiment, in which your purposefully control and vary the values of the potential "causor", and measure the result on the "causee".
Again!! A statistically significant correlation coefficient does not "prove" causation. It states the likelihood that there is a relationship.This question should be stopped because this answer has been given many times before. Those who ask this question should start reading a statistics book.
Here's a list of really helpful correlations.. perhaps looking at it will help people get a better sense of correlation vs causation. Enjoy!
Spurious Correlations -> http://www.tylervigen.com
This is my own intuition of the problem:
1. “Proving” infers a determinant system. Statistics cannot prove anything because as a framework, it is predicated on a probabilistic system. This is why we have the error term.
2. Correlation is a necessary condition of cause and effect, but not a sufficient condition.
3. It follows, cause and effect creates a correlation, but correlations are not by default causative.
4. Furthermore, correlation is only one metric of relationship between measured variables, and the corresponding test statistic and p-value reflects only the probability that the relationship occurred due to chance variation alone. Thus, correlation and the test statistic cannot be used to determine the nature of the relationship (in terms of cause), but rather only reflect evidence that a relationship exists (and estimate the strength of said relationship).
5. As a consequence, the ability to infer causality relies on the (experimental) design rather than the (statistical) analysis.
Good summary, Grayson. But I have a concern with one off-topic point:
"p-value reflects only the probability that the relationship occurred due to chance variation alone."
To my understanding (and I think I really tried hard to understand this!) "probability" and "chance" are synonyms, expressing some ignorance (lack of definite knowledge) about an observation or variable, so that we can only provide more or less confident expectations. If this is correct, the sentence I cited is an non-sensical circular statement. I meantion this because I read such statements often in many statistics books and texts, and I think this is a source of a major confusion about statistics, probability, p-values etc. that develops in the minds of many after relying on (and not further thinking about) such non-sensical statements.
I would be happy If you to hear some good explanations why I am wrong with my thoughts.
To give a (in my opinion) correct definition of a p-value: the p-value is the probability under the null hypothesis to get data that is at least as extreme or more extreme as the actually observed data. (*) This is a very technical sentence, I know, and it won't be understood by many. So I understand that it is very tempting to "simplify" it or use different words or ways to express it. However, I think (after hard and long thinking about this) that any change to this sentence will introduce some severe, deleterious misconception or error or misunderstanding that subsequently leads to these frequent misinterpretations about hypothesis tests and their abuse, and, therefore, should be avoided. I am of the opinion that it might be better to acknoledge that one does not really understand what it means instead of being sure to have understood the concept that is actually flawed and a wrong (and go on with misinterpretations and abuse)!
(*) you may exchanche "probability" by "chance" and "expectation". This won't change the meaning, but it might make clear what the sentence is about. If and only if our expectations about the observations will correlate with the relative frequencies we do get if we do take an ever growing series of similar samples from the same population, then and only then a p-value can also be interpreted as a relative frequency in such a scenario, and this justifies the Neymanian hypothesis tests and their control of error-rates.
Hi Jochen,
I appreciate all too well where you are coming from and agree that what is presented in textbooks is often unclear. I use the definition “the p-value is the probability of attaining the observed value or greater value, assuming the null, given infinite trials” when teaching; however, given this topic area and this venue, I attempted to use a more accessible definition.
If I understand your argument correctly, then I believe I should clarify that I distinguish between the concept of “the chance” and “chance”. Perhaps this reflects my misunderstanding semantically, but I interpret “the chance” as a probability value and “chance” as denoting random. Thus, when I referred to “chance variation”, I meant random variation assumed in the concept of the null and “probability” as the probability value of attaining the observed value (or greater value).
I agree the semantics can be confusing. One can have a high chance (probability) of encountering someone while also having a chance (random) encounter. My mentor and advisor always held that the phrase “random chance” was redundant; though he was a student of Neyman’s, I don’t know what Neyman’s own interpretation would have been. I welcome your thoughts.
Thank you for this enlightning answer, Grayson. It then might help to avoid some confusion if you really used the word "probability" instead of the word "the chance". Denoting something as "chance" or "random" implies that there is some uncertainty or ignorance about the observations, and "probability" is a relative measure of expectations for the different possible observations. Thus, something "happening by chance" does not at all mean that there are no rules or that it is undeterministic or something like this. The only thing we can reasonably express is that we simply deal with a variable where we do not have all information sufficient to make correct predictions (no matter what the physical or practical reasons for this ignorance are).
This is still not really clear to me: "One can have a high chance (probability) of encountering someone while also having a chance (random) encounter."
Often the example of flipping a coin is given to introduce the concept of a random variable. The explanation is about "repeating the experiment" and getting different results (sometimes head, sometimes tails). This is already omitting the most important point in the story: the experiment is NOT repeated under exactly the same conditions. It is NOT the property of the coin-flip itself that leads to different possible outcomes but rather the fact that the experimental conditions are varying. And since there is nothing miraculous in flipping a coin, knowing the exact conditions of the flip will tell us the definite result. Thus, by itself a coin flip is just NOT a random variable. The problem starts with variability of the conditions from toss to toss, and that we do not know the kind and amount of this variation.
Further, usually the "fairness of the coin" is analyzed. Again a mental misonception. It is not a property of the coin to be "fair" or not. It is in the way the coin is tossed. Depending on the kind and size of the variation between tosses any limiting relative frequency of heads will establish. The key is, when we have NO ideas about how the coin is tossed (and also have only insufficient information [or don't make use of theoretically available information] abou the geometry of the coin, the air, and all other parts of the experimental setup) we have no reason to expect heads more (or less) likely to show up than tails. We then assign equal probabilities to each possible outcome. The argument that we do this because it would be a property of the (fair) coin to produce a uniform frequency distribution of the outcomes is mentally and conceptionally misleading. This is often confused with the fact that probability distributions are just practically useful in a very particular way when they have the same shape like the frequency distributions (this is the basis of Neyman's testing theory).
The whole misconception becomes more evident if we do not flip a coin but ask for the side showing up of a coin that is just lying on the table. People confusing probabilities and frequencies cannot give an answer here, since the outcome is fixed a priori, there is nothing variable, and there is no possible rplication of the "experiment". But still we can make a guess about the side showing up, as long as we do not know whch side is up. This ignorance we have makes the result "random" for us. Again, if we do not know anything further, we have no reason to find one side more likely that the other.
If I tell you that a "zok" (do not try to guess what it is; it is just some unknown physical object) is laying on my table and there are two ways how it can lie: "A" and "B", just like haed and tail of a coin, and ask you for a guess on which side is may lie. A reasonable answer is the same as for the coin: you should expect A and B with equal probability.
You may try to learn something to improve your guess/expectation. You can ask many other people (best such ones who are as similar as possible to I) if the have a "zok" on their table and at what position it is laying there. If most people tell you that it is on side "A", this might also be a more reasonable guess for my particular case. You would resonably give a higher probability for "A" than for "B".
But if I now tell you that this "zok" is a water bottle, and the two sides are the A:bottom (the bottle is "standing") or B:side (the bottle is "laying"). Now you know a hell lot more details and therefore you have particular expectations. It now seems much more likey that the bottle is standing.
Among the three criteria for inferring causality, correlation is just one. The other two are the following: (i) time order (cause precedes effect) and (ii) internal validity (elimination of plausible alternative explanations).
In a way or another, this point of view was brought into the discussion; I wanted only to formulate it more synthetically.
Thanks to those who have thought it over, and enriched my understanding, in particular dr. Jochen Wilhelm and dr. Ehsan Khedive.
Statistical methods measure numerical association, only. For serious guidance on causal evidence, see Austin Bradford Hill's excellent essay here: http://www.edwardtufte.com/tufte/hill.