Interdependence or data points can be identified before or after data analysis, or not? How does the notion 'independence of data point' influence the organisation and analysis of your data sets?
Independence refers to our knowledge about influences that are common for (a subset of) the data (sometimes called a "common source of variance"). Thus, independence can only be "assured" by an appropriate design of experiments and by using an adequate model (accounting [more or less] correctly for all known common sources of variance). Thus, independence is actually a property of the residuals and not of the data.
A simple example: the effect of a drug against high blood pressure is analyzed. The model accounts for the groups (treated and untreated), and beyond this the data (i.e. the residuals) are independent. However, if we knew that several measurements were taken from the same individual, these values can not be seen as independent anymore, because the residuals of the values from one individual will be closer together than to the residuals of the other values. This can be accounted for by the model ("repeated measures") so that the remaining residuals will again be independent.
But a lack of independence is not restricted to repeated measures. Any knowledge about a common source of variance destroys the independence. For instance, if we knew that the drug has a different effect in males and in females, the residuals will not be independent (we know that residuals from males will be closer together than to the residuals of the females). This can be resolved when the model accounts for the sex-effect.
Independence refers to our knowledge about influences that are common for (a subset of) the data (sometimes called a "common source of variance"). Thus, independence can only be "assured" by an appropriate design of experiments and by using an adequate model (accounting [more or less] correctly for all known common sources of variance). Thus, independence is actually a property of the residuals and not of the data.
A simple example: the effect of a drug against high blood pressure is analyzed. The model accounts for the groups (treated and untreated), and beyond this the data (i.e. the residuals) are independent. However, if we knew that several measurements were taken from the same individual, these values can not be seen as independent anymore, because the residuals of the values from one individual will be closer together than to the residuals of the other values. This can be accounted for by the model ("repeated measures") so that the remaining residuals will again be independent.
But a lack of independence is not restricted to repeated measures. Any knowledge about a common source of variance destroys the independence. For instance, if we knew that the drug has a different effect in males and in females, the residuals will not be independent (we know that residuals from males will be closer together than to the residuals of the females). This can be resolved when the model accounts for the sex-effect.
Let's take the example of the individual. Many phenotypic traits are not closely correlated within individuals (e.g. blood profile and body length). Does this imply that the definition of what is independent or not will simply depend on the (e.g. biology-based) background knowledge available?
The story get a little complicated when the data is gathered and analyzed. If one then recognizes patterns in the distribution of the residuals, one learned something - namely that the model lacks some obviousely important predictor. However, since we do not know what predictor is missing (i.e. why the pattern shows up) the residuals are still considered "independent".
There is another dimension of "independence", with a very current interest (see link): measureing the same thing several times, but using the inddividual (repeated) mesurements as if they were taken from different individuals. Again, these values taken from the same thing (the same gene in the example) have residuals that are much closer to each other than to those of other things (genes). This means that the variability between things (genes) will be underestimated. Here, the design of the experiment (using several values from the same "thing") very clearly destroys the independence, and there can not be any different opinion (or a different state of knowledge) about this.
Marcel's argument has some weight and cannot be answered vaguely. In fact, when we say data or data points are independent, we mean that the data for different subjects do not depend on each other. When we say a variable is independent we mean that it does not depend on another variable for the same subject.
Former independence may be tested using serial or spatial autocorrelation and latter by using Chi-square test as suggested in the link below:
First of all we have to accept that there exists a type of something like Granger-non causality, in order to accept that a data point is independent than another one, so we can collect the two points at a sample that is generated from "Independent distributed random variables" (the concept of identical is not always necessary).
If we work a dataset of V variables and N cases -this one represent receptors population-, and apply a method that correlates and build their possible graphs, the method does not care if the variables are independent or not, it simply creates models that force correlations among them. If they are independent there is no need to correlate them, and if researcher has good reasons to believe they are dependent, then there are good reasons to correlate them using a good method to model the answer. I guess that the main problem is not to ask independence but to be somewhat sure that the V variables, and N, are related to the main variable of interest. The concept of independence may be used reasonably with probabilistic hazard games like dice, but not with well designed samples made of related variables selected from the best available experience about their possible mutual links, sequential presence, and possible causal participation. Good question that brings different views. emilio
Вы затронули очень интересную тему. Как я считаю даже при наборе данных независимых переменных важна роль "оператора". И любая "незавимость" будет упираться в фактор кто набирал данные и как. С этим многие думаю сталкивались в научных исследованиях. Сталкивался с этим и я. Поясню. Набирал данные по фенотипической изменчивости лягушек. Все делалось по одной и той же методике. Я также обучал своих студентов набирать материал по этой методике. Естественно брал данные других исследователей также осуществленной по этой же методике. В итоге. Провел кластерный анализ. И что удивительно все данные по фенотипической изменчивости лягушек (мои и моих студентов) попали в один кластер. Данные других исследователей в другой кластер. Таким образом, стало очевидно при наборе данных важен "оператор"
Thanks for these answers! They are helpful to a question I am trying to get my head around, but I am still a bit unsure:
I am trying to find out whether connectivity between habitat patches influences the butterfly populations found in the patches, so I will be collecting data on what butterfly species are found in each patch, along with a few other known and measurable explanatory variables (e.g. food plants and grass height), and analysing whether distance to nearest occupied patch influences the presence/absence of butterfly species on a patch, above and beyond the other known variables.
Presumably my samples are inherently not independent if the hypothesis that I am testing is correct, because then the presence of a butterfly species in one sample (i,e. patch) would influence whether it was present in the next sample (patch).
Is this OK, because distance to nearest occupied patch will be accounted for as a variable in the analysis, and is essentially the "treatment" in this case? Or is it a bit more complicated, because it is a rather complex continuous variable across all patches and not a categorical variable such as "medicine" vs "placebo" or "male" vs "female"?
All explanations of independence vs dependence I have found so far have categorical input variables and continuous output variables, but mine are the other way around!
This potential independence is what you want to model, so "that's not a bug - it's a feature!" ;)
This is the same as the "dpendence" of the the measured values in the "medicine" vs. "placebo" group, or the dependence between Y and X in a regression.
It would be a "problem", for instance, if you knew that the abundance of species A and B are correlated and if you do not appropriately model this. Then the abundance values for A would not be independent from those of B.If your model does not consider this, the results may may "suboptimal", as the model would not use all available information appropriately.
Independence has another characteristics, it is assumed to have no error in its measurement, while dependence has due to random process or for not accounted processes or variables. For example, it is not necessary that all butterflies on a patch of grass or flowers would be 100% of the same kind, there may be some other kind of butterflies there for reasons not in your file. Of course, if you have already collected data points for several variables in which one is deponent and other or independent, you can test Independence of data point using any of the following method:
If I am understanding your analysis correctly, your unit of analysis is the patch, about which you are measuring various characteristics including your outcome - presence or absence of species Y? This problem sounds like a good fit for network analysis. Network analysis lets you test the hypotheses you have about networks of patches (including whether or not they are a network), using characteristics such as distance as predictors of the network. My experience with it mostly deals with social network analysis, but if you replace people with patches I think you can make a functional translation. See this review piece, and see if you can find similar conversations about network analysis in your field.