My understanding of the statistical sampling error is derived from the binomial distribution where the variance, which is the square of the standard deviation, is simply N, the size of the sample. I believe that I know, or once knew, that the confidence limit is calculated from this distribution using an error function integral, Erf(x). In my physics career, we always worked with samples that were small compared with the whole population, and one got used to calculating the statistical component of the total error from sigma = SQRT(N). Estimating systematic errors was where most of the error analysis effort was spent.
Now I am doing social science research where one has access, on some occasions, to entire populations, for example, the number of women graduating with Electrical Engineering degrees from U.S. institutions year-by-year. When I propagate the sampling errors through a calculation of the ratio of women to women and men graduates, I get an enormous error, larger than and comparable to the ratio itself.
Clearly, I am doing something wrong. For example:
1.) Elementary calculus or algebraic errors (see attached)
2.) Violating assumption(s) of the simple propagation of errors scheme:
a.) the component errors are small compared with the measured quantities.
b.) the component errors are uncorrelated with one another.
3.) Applying sigma = SQRT(N) when I have the entire sample
4.) Something else I remain blind to. My main mistake was adding the central value to the error before graphing the main value with error bars. Merely graphing the main value with error bars works as expected. (Edit made 27 August 2014.)
For what it is worth, another physicist kindly checked (1.) and (2.) for me and said they were correct.
It seems to me that it should be possible to calculate the sampling error as the sample approaches 100% of the population and the confidence limit approaches 100%. In this case, the sampling error should approach zero, smoothly, I would imagine.
If the sampling error is simply zero when one has the entire sample, this formula should tell me. If I had this formula, and understand how to derive it, it seems that the sampling error being zero when one has the entire sample would be easier to accept.
It seems to me that there would still be systematic errors. But these are hard to estimate, particularly where self-reported data is aggregated nationwide.
It also seems to me that there may still be errors related to the population size, but I have no understanding or intuition for that, other than the fact that the data look "naked" to me without their error bars and that to the naked eye, smaller populations (e.g. astronomy) appear to have more year-to-year statistical fluctuation that larger populations (e.g. biology)
If one of you could get me back on the statistical path, I would greatly appreciate that.
Thanks,
Mark Frautschi