To help ground my arguments consider I have all the deaths in a country and I have them for all the sub-areas of a country and policy makers are thinking of targeting funding at places with high rates of mortality.
A major challenge of using such data is the greater importance of natural, stochastic or chance variation. It is worth discussing what we mean by this as there is a lot on confusion on this topic. Thus Gorard (2013) strongly argues that a modelling and inferential approach is not needed when dealing with populations as we have here
Gorard (2013, 54) “all traditional statistical analysis, including all tests of significance, and the use of standard errors and confidence intervals… are, of course irrelevant when the full population of cases is used since then there is no sampling variation”.
Gorard, S. (2013). Research Design: Robust approaches for the social sciences. London: SAGE.
For him inference should be confined to inferring from imprecise samples to true, but unknown population values. Consequently, we do not need an inferential framework here because we have all the areas and all the deaths.
I disagree. For me the imprecision arises not as a result of sampling variation but because of natural variation. The number of observed deaths is considered to be the outcome of a stochastic process which could produce different results under the same circumstances. It is this underlying process or risk that is of interest and the actual observed values give only an imprecise estimate of this. Moreover, the smaller the number of deaths the more imprecise the estimate of risk. If the expected number of deaths is 1 and the observed number is 1, the risk is 1 and the area has the same risk as the country as a whole. However, if the observed deaths were recorded as 0, the risk would be zero and this would be the best place in the country. If the observed is 2, the risk would be double the national rate, and a high risk area. Very small and highly plausible changes in the observed outcome would have a huge impact on the risk, and the risk would be estimated very unreliably. We need to be able to say with confidence that this risk is high and not likely to be due to chance.
The finer the grain – be it in terms of space, time, age and cause, the more the estimates of risks will be based on fewer and fewer events thereby exacerbating the problem . We must surely guard against drawing conclusions and making policy recommendations that reflect merely chance outcomes. The danger is that any ' signal ' in the data is lost in random noise, and little confidence can be placed in the results. There is a real danger that in the zeal to have timely and detailed targeted funding that year-on-year oscillation in the amount of money allocated reflects random variation in the number of deaths, rather than genuine
underlying differences between areas.
So I want to use inferential procedures to guard against making hasty judgements when there is a lot of natural variation even where there are populations involved. The key is the importance of the stochastic element.
Moreover, we could dispense with notions of samples and populations altogether and take a Bayesian approach treating observed data as evidence and consider the degree of support for the estimate of risk, assessing the extent to which there is credible evidence that the risk is elevated.
The concept of inferential statistics is drawing conclusion about population from sample information. If you are applying to a population, it means that you may not need a statistical test.
You are making an assessment of a populations parameters from the sample. As Aishah says, if you have the populate you do not need to make inferences about it but simply measure it. The significance if either present or not. In which case you are dealing with descriptive statistics.
To add a different perspective: I would argue that often you are not interested in 'the population' in the sense of (for example) all the people currently alive in a certain country, but in a 'superpopulation', that is an infinite population of all the people that could have lived given the processes that shape the population. As such, you can infer to the superpopulation based on your sample (which may be the whole population) with inferential statistics.
For example, if you were interested in the health of people born in Antarctica and living in the UK. Chances are there aren't many people identifying as this. With descriptive statistics, you may find that, taking the whole population of the UK, that the Antarctican population are on average more healthy than everyone else. But, if there were only, say, 2 Antarcticans in the population, you would not say that effect was significant, in the sense of you uncovering a real, meaningful underlying process, just because you have the whole population in your dataset. However, by thinking of the population as a sample from the superpopulation, we can then see the significance (or lack thereof) of the differences between Antarcticans and non-Antarcticans, and you need a statistical test to do this.
Dear Eddie, let me give a logical and a mathematical perspective of inference.
Logical: We use inferential statistics when we are not CERTAIN about the population parameter, and we should ESTIMATE that as a STATISTIC. So when we have whole population we CALCULATE the parameter, and not ESTIMATE it. In other words, there is no error about parameter calculations, so this is not inferential statistics, even, this is not statistics at all. It is mathematics.
Mathematical: In classic statistics we know that the parameters of population are fixed values, and so the standard deviation (SD) of population is. Besides we regard population as infinite (n= infinite). Therefore, the standard error of estimate (SE) is as:
SE=SD/n=0
This formula says that as the sample size increases to population size, the standard error of estimate goes toward zero.
All of the statistical tests are based on estimation error, and when we have no error, we get the exact value and not inferred value.
To help ground my arguments consider I have all the deaths in a country and I have them for all the sub-areas of a country and policy makers are thinking of targeting funding at places with high rates of mortality.
A major challenge of using such data is the greater importance of natural, stochastic or chance variation. It is worth discussing what we mean by this as there is a lot on confusion on this topic. Thus Gorard (2013) strongly argues that a modelling and inferential approach is not needed when dealing with populations as we have here
Gorard (2013, 54) “all traditional statistical analysis, including all tests of significance, and the use of standard errors and confidence intervals… are, of course irrelevant when the full population of cases is used since then there is no sampling variation”.
Gorard, S. (2013). Research Design: Robust approaches for the social sciences. London: SAGE.
For him inference should be confined to inferring from imprecise samples to true, but unknown population values. Consequently, we do not need an inferential framework here because we have all the areas and all the deaths.
I disagree. For me the imprecision arises not as a result of sampling variation but because of natural variation. The number of observed deaths is considered to be the outcome of a stochastic process which could produce different results under the same circumstances. It is this underlying process or risk that is of interest and the actual observed values give only an imprecise estimate of this. Moreover, the smaller the number of deaths the more imprecise the estimate of risk. If the expected number of deaths is 1 and the observed number is 1, the risk is 1 and the area has the same risk as the country as a whole. However, if the observed deaths were recorded as 0, the risk would be zero and this would be the best place in the country. If the observed is 2, the risk would be double the national rate, and a high risk area. Very small and highly plausible changes in the observed outcome would have a huge impact on the risk, and the risk would be estimated very unreliably. We need to be able to say with confidence that this risk is high and not likely to be due to chance.
The finer the grain – be it in terms of space, time, age and cause, the more the estimates of risks will be based on fewer and fewer events thereby exacerbating the problem . We must surely guard against drawing conclusions and making policy recommendations that reflect merely chance outcomes. The danger is that any ' signal ' in the data is lost in random noise, and little confidence can be placed in the results. There is a real danger that in the zeal to have timely and detailed targeted funding that year-on-year oscillation in the amount of money allocated reflects random variation in the number of deaths, rather than genuine
underlying differences between areas.
So I want to use inferential procedures to guard against making hasty judgements when there is a lot of natural variation even where there are populations involved. The key is the importance of the stochastic element.
Moreover, we could dispense with notions of samples and populations altogether and take a Bayesian approach treating observed data as evidence and consider the degree of support for the estimate of risk, assessing the extent to which there is credible evidence that the risk is elevated.
You can apply your test results to the whole population (that's the point of statistical inference), provided you are aware of type I and Type II errors.
Friends, thank you for the valuable information shared.
May I also share this information I was able to search
"Probability sampling is the only one general approach that allows the researcher to use the principles of statistical inference to generalize from the sample to the population (Frankfort- Nachmias & Leon-Guerrero 2002), its characteristic is fundamental to the study of inferential statistics (Davis, Utts & Simon, 2002). It is the sampling technique that uses the probability theory to calculate the likelihood of selecting a particular sample and allows the drawing of conclusions about the population from the sample.(Pelosi, Sandifer, & Sekaram, 2001) and it has the advantage of projecting the sample survey results to the population (McDaniel & Gates, 2002). Inferential statistical analyses are based on the assumption that the samples being analyzed are probability samples (Burns & Grove 1997)."
It's not so much that only randomisation 'allows' us to use significance tests etc. (because only then is there a standard error). As I have said before that would be like saying we eat so that we can then wash up the plates. If we have randomisation then we might want to assess the chance that the result we have arose from the random sampling. If we do not have randomisation then that is not an issue. If we do not eat we have no plates to wash!
The idea that the one population we might deal with is somehow a random example of an infinite number of possible imaginary populations strikes me as absurd. Perhaps desperation to retain favoured techniques not fully understood. For more, see what Berk and Freedman say in:
Inferential statistics, unlike descriptive statistics, is a study to apply the conclusions that have been obtained from one experimental study to more general populations. This means inferential statistics tries to answer questions about populations and samples that have never been tested in the given experiment.
Inferential statistics infer from the sample to the population, they determine the probability of characteristics of a population based on the characteristics of your sample, they also help assess the strength of the relationship between your independent variables, and your dependent (effect) variables. With inferential statistics, you can take the data from any samples and make generalizations about a population.
There are two main areas in case of inferential statistics, estimating parameters. This means taking a statistic from your sample data and using it to say something about a population parameter and hypothesis tests, this is where you can use sample data to answer research questions.
Yes Sudeep. If these approaches worked these are the two situations in which the maths allows their use, With full population data they are not needed and with any non-randomised cases they can never be used. It's really quite simple. So ALL real-life examples I have ever seen are incorrect.