How do I calculate statistical sampling error and confidence limits as my sample approaches the entire population?

08 August 2014 42 3K Report

My understanding of the statistical sampling error is derived from the binomial distribution where the variance, which is the square of the standard deviation, is simply N, the size of the sample. I believe that I know, or once knew, that the confidence limit is calculated from this distribution using an error function integral, Erf(x). In my physics career, we always worked with samples that were small compared with the whole population, and one got used to calculating the statistical component of the total error from sigma = SQRT(N). Estimating systematic errors was where most of the error analysis effort was spent.

Now I am doing social science research where one has access, on some occasions, to entire populations, for example, the number of women graduating with Electrical Engineering degrees from U.S. institutions year-by-year. When I propagate the sampling errors through a calculation of the ratio of women to women and men graduates, I get an enormous error, larger than and comparable to the ratio itself.

Clearly, I am doing something wrong. For example:

1.) Elementary calculus or algebraic errors (see attached)

2.) Violating assumption(s) of the simple propagation of errors scheme:

a.) the component errors are small compared with the measured quantities.

b.) the component errors are uncorrelated with one another.

3.) Applying sigma = SQRT(N) when I have the entire sample

4.) Something else I remain blind to. My main mistake was adding the central value to the error before graphing the main value with error bars. Merely graphing the main value with error bars works as expected. (Edit made 27 August 2014.)

For what it is worth, another physicist kindly checked (1.) and (2.) for me and said they were correct.

It seems to me that it should be possible to calculate the sampling error as the sample approaches 100% of the population and the confidence limit approaches 100%. In this case, the sampling error should approach zero, smoothly, I would imagine.

If the sampling error is simply zero when one has the entire sample, this formula should tell me. If I had this formula, and understand how to derive it, it seems that the sampling error being zero when one has the entire sample would be easier to accept.

It seems to me that there would still be systematic errors. But these are hard to estimate, particularly where self-reported data is aggregated nationwide.

It also seems to me that there may still be errors related to the population size, but I have no understanding or intuition for that, other than the fact that the data look "naked" to me without their error bars and that to the naked eye, smaller populations (e.g. astronomy) appear to have more year-to-year statistical fluctuation that larger populations (e.g. biology)

If one of you could get me back on the statistical path, I would greatly appreciate that.

Thanks,

Mark Frautschi

H.E. Lehtihet

I am not sure I understand your question. The sqrt(N) error is not the error on the proportion. It is the error on the number of counts so that the relative error does decrease with N.

Andrew Ekstrom

Hey Mark,

I was a physicist at one point. I switched to applied mathematics and stats. In doing so, I learned that most of the stat tests we used in physics were just plain wrong.

If you are dealing with large groups of subjects, you want to use software to do the calculations. I use Stata for a lot of my analyses of observational data. Hand calcs take too long and tend to be wrong due to the number of mistakes one makes. You can use R, which is free, if you don't want to spend your hard earned cash on software.

The confidence interval calculation is based upon what you are analyzing. If you are looking for a ratio, say women vs men, there will be one calc for that. If you look at the % women, this is a CI based upon Sqrt(P*Q/N). The more data you have, the smaller the CI you will get.

Mark Frautschi

Dear H.E. Lehtihet:

Thank you very much for your response. Yes, SQRT(N) is not the statistical sampling error on the ratio of women to women and men. SQRT(N_women) is the statistical error on the number of women, SQRT(N_men) is the statistical error on the number of men. I think we are tracking each other well up to this point. Where I may have led you astray is by placing my hand calculation for the statistical error on the ratio in an attached JPEG file instead of in the body of my question. That broke the flow of my question. I apologize for that. If it were easy (possible?) to place algebra and calculus in the body of my question, I would have, but instead I took the easy way out and attached the JPEG of my hand calculation of the propagation of errors through the calculation of the ratio.

If you can read my handwriting, perhaps things will be a bit clearer from inspecting my hand calculation.

Dear Andrew Ekstrom:

I also thank you for your response. I admire you for venturing into the social sciences. You are dealing with systems of enormous complexity compared to some of the systems physicists study. To put on a cultural anthropology hat, my impression (I would welcome you sharing yours) is that in terms of statistics, the social sciences divide into three subcultures. Those who know their statistics and mathematics cold, those whose work has no use for statistics whatsoever and, well, let me just say that there is another subculture that has not learned their statistics and instead has learned various coping mechanisms as well, some of which are passed from advisor to student, generation after generation. In some cases these coping mechanisms have become cottage industries where a researcher can outsource their data and its analysis completely, and just paste final graphs into their work. This has implications in terms of author abdication, provenance of data, provenance of analysis, reproducibility of results, transparency and intellectual honesty. Of course, to the extent that physicists and other so-called "hard" scientists have any awareness whatsoever, this is the subculture that seems to most attract their attention.

I appreciate your comment about R, Stata, HUDAP, SPSS and so on. However, I cannot trust any package until I can reliably reproduce my hand calculations with it. I just do not know what is happening otherwise. For example, some of the higher order statistical modes have N in the denominator, others have N-1, for the same named mode. Which is right? Which does the program use? Both questions are important. I find that doing this teaches me (1.) where I make mistakes in my hand calculations (2.) where I make mistakes applying a perfectly good analysis package (3.) where I have simply chosen the wrong tool for the job and, on rare occasions, (4.) where there is some problem with the analysis package itself.

Thank you for your SQRT(P*Q/N) confidence limit formula. Would you kindly supply a reference for this formula?

Thank you both for your responses.

Mark

H.E. Lehtihet

Dear Mark,

The following paper compares 20 methods of interval estimation for a binomial proportion You might want to consider also getting a CI using a non-parametric bootstrap method.

Hope this helps

https://www.ine.pt/revstat/pdf/rs080204.pdf

Mark Frautschi

Dear H.E. Lehtihet,

Thanks for sending the article. Perhaps one of the twenty approaches is what I need. I will scope the article and if it seems like it might fit, plow in fully.

Thanks,

Mark

H.E. Lehtihet

Dear Mark,

You are welcome..

There is nothing wrong in the calculations shown in your JPEG file.

Let N = n + m ; p = n / N and q = m / N so that p + q = 1

It follows that

A = sqrt( pq / N ) = sqrt( nmN ) / N² = sqrt( n²m + m²n ) / (n + m)² which is the same as the result given in your file.

Moreover, in the particular case where p is very small then

A ~ sqrt( n ) / N

so that the error in the number of counts becomes sqrt( n ) as expected.

Mark Frautschi

Dear H.E. Lehtihet,

Thanks again for your help. The most interesting case, where most of the data lies if one looks at all fields, not just the mathematically intensive ones, is for p ~ q ~ 0.5. When I implement the error propagation we both seem to be in agreement on, I get errors which are on the order of +/- 0.5. If somehow that is right, then all of the gender comparisons lie within one standard deviation of each other and the data are unable to tell us anything. Please see the attached file.

How can I usefully estimate statistical sampling errors, or any statistical effects, for that matter?

Thanks,

Mark

H.E. Lehtihet

Dear Mark,

Thanks for your feedback. The errors do not seem to be correct. However,your file does not give any information regarding the number of counts. Could you please provide a detailed numerical example that shows how you actually compute the error on a given point of your data..

Mark Frautschi

Dear H.E. Lehtihet,

The whole problem (or series of problems) lies within a single Microsoft Excel XLSX format spreadsheet. My SPAMbot inhibited e-mail address is mark dot frautschi at verizon dot net . If you do not mind keeping the spreadsheet private and don't mind sending me your e-mail address by e-mail, I can just send my spreadsheet file to you. Obviously, it is a work in progress.

At the bottom of my hand calculation example, I give the Excel cell numbers for the calculation. These correspond to the first sheet of the spreadsheet file, which contains the raw data, and some of the time-shifted data and error bar calculations.

Would this suffice to provide the numerical example you seek?

If this works, please send me an e-mail and I will send you the confidential spreadsheet, a work very much in progress.

Thanks,

Mark

H.E. Lehtihet

Dear Mark,

I have sent you a message via RG

Erik Mønness

If N is the population and n i the sample, p is the propotion of the property in question, ^p is the propotion in the sample, the std of ^p is

SQRT(^p*(1-^p)/n)*SQRT((N-n)/(N-1)). The last SQRT corrects for sampling a larger part of the population (beeing zero for sampling the entire polulation)

George Hart

I am having a problem with “access, on some occasions, to entire populations”. IF you have the entire population you are not dealing with a sampling problem and the numbers are the real parameters of the population i.e. real averages, variability etc. I don't think that is what you mean – am I correct?

My question is 'what is the target population you want to investigate?' Your annual tallies seem to be SAMPLES drawn from the target population, but what is it?

My late friend Dr. John Griffiths noted three types of target populations that we might be interested in [Griffiths and Ondrick,1969].

1. The hypothetical population is the theoretical population that should have been formed under the constraints of the model.

2. The existent population is the real population which now exists and is the result of modification of the theoretical population.

3. The available population is the readily available population from which samples are drawn.

Is your hypothetical population 'the number of women graduating with engineering degrees in electrical engineering', and you have annual samples? If so I do not see why conventional sampling statistics would be wrong: and the questions you are asking of your data are either not the right questions or the answers need to be interpreted in a different way.

I realize I am not solving your problem but it occurred to me that if you changed the questions you are asking your data it might provide a better insight.

George Hart, LSU

Harald Lang

I hope you read George Hart’s insightsful answer above.

Here is my contribution.

It seems to me that #females and #males come from Poisson distributions, rather than binomial. In that case you are correct that their variances are estimated as these very numbers (variance for # females = #females, approximately).

If you look at the ratio #females / #males, then you have a ratio of two independent (?) Poisson distributions. (Note that if you have #females + #males in the denominator, then numerator and denominator are not independent)

I don’t think this ratio has a known distribution. However, I can see two possible approaches to obtain a confidence interval.

If n= #females and m= #males are reasonable large, then you get an approximate confidence interval as follows: upper bound is the ratio n/m multiplied by the a/2 quantile of the F(2n, 2m)-distribution, and the lower bound is the ratio n/m multiplied by the 1-a/2 quantile. Here a is the risk level (for instance, a=0.05).

A more accurate approach is to employ simulation: Draw 2000 (or some such) independent pairs (x, y) of independent Poisson variables with intensities (n) and (m), respectively. Draw the upper bound where a fraction a/2 of your rations x/y fall above that line, and similarly the lower bound where a fraction a/2 falls below.

The drawback with the first approach is that it is my own idea, so I can’t give you a reference (I haven’t written it down.) I suppose that a referee would request a reference,

However, the proof is a short calculation, and if you want, I can write it down for you, such that you can add it in.

[Of course, if you have the ratio q=n/m, the ratio n/(n+m) is equal to q/(1+q), so you can easily convert to n/(n+m)]

Cheers -- Harald

James R Knaub

Mark - Hi. - You might be interested in the finite population correction factor, described in the link attached. It means that the variance seen from the data observed is only applied to the part of the finite population that is not observed. Note that this assumes the data observed are correct - no measurement errors, which opens up another can of worms: nonsampling error.

Finite populations don't often come up in physics applications.

Thanks - Jim

Article FINITE POPULATION CORRECTION (fpc) FACTOR

Andrew Ekstrom

Hey Mark,

You might have an even "worse" mess on your hands.

Do you take data from the same universities every year? If so, that makes your analysis a "Repeat Measures" or "Panel" data analysis.

To further complicate things, there is a possibility that there is/will be an issue with "autocorrelation". This would require a Time-Series method of analysis. I would have to see the data to make sure though.

As far as software goes, SAS is actually required by a lot of government agencies in the USA.

I tutor college and university level physics classes. One of the issues I came across when dealing with data analysis deals with what the prof and students called "Error Analysis". The students had a lab where they would measure the force based upon the angle a cart travelled down a track. The students were told that they needed to calculate the "error" at each angle they used. One of your hand calcs reminded me of their formula. In stats, there is an assumption that the variability(error) at every point in the design space is the same. The "Error Analysis" these students used violates this assumption. The statistical method uses a confidence ellipsoid, which is calculated by the software. With these students data, I was able to get the model they needed and the confidence ellipsoid with my software in about 2 minutes from data entry to analysis. The students spent hours trying to get the answers the prof wanted. In the end, the prof's answers over estimated the error at the high end and underestimated the error at the low end. With my software, I was able to predict the confidence and prediction intervals for every angle they selected.

When it comes to propagation of error, I use simulations to find out what the total error could be based upon all the parameters and their experimental uncertainties of my system. The variability in the entire system is usually a lot larger or a lot smaller than what POE predicts/states. Since my simulations are based upon the devices I have to use, not what the manufacturer claims, I don't use POE. I simulate instead.

Do you have some sample data you can share? I would be interested to see how it is structured.

As others have stated, there are many ways to analyze the data, logit transforms, poisson regressions, ratios, etc. Looking at the data will help us out a lot.

James R Knaub

Interesting comments from Andrew. You might want to look into heteroscedasticity.

Article HETEROSCEDASTICITY AND HOMOSCEDASTICITY

Richard Anthony Champion Jr

This does not seem to be a sampling problem at all. When you have an entire population, you have a problem in descriptive statistics. Start with means and standard deviations. If you have these for several years running, then you have a time series. Formulate what you think you need to know over time.

James R Knaub

Note that Erik's contribution above mine on the finite population correction factor has the fpc apparently built in. (I tried to add that to my note, but it seems my phone would not do it.)

Giancarlo Ruggieri

The fact that you have near the universe number of the population is unimportant. Undoubtedly more are the data, more the results could attain suitable results. But the pivotal point is IF you have selected a RANDOM data sample concerning the characteristics of the population that you would study. If I am waiting a person in a street I could have the occasion to see the passing of a procession whose components were the 45 % of the city population and the 75 % of which were males : undoubtedly the 45 % of the population is a very great sample of the population of the city, but this not signifies the 75 % of the people in that city bases on males, but simply that the reason of the processing interested principally male subjects.

Mark Frautschi

I want to thank everyone who has written in the last week or so. With the depth and breadth of the responses, I have not kept up. I am still working my way through this output, but, with the help of H.E. Lehtihet, I was finally to remove one of my large errors, namely that I did not instruct Excel correctly to plot the error bars as I had calculated them. Whether I calculated them correctly remains the focus of my inquiry, however, I was not looking at my calculation, through my own misunderstanding of how Excel handles error bars, I inflated the value of the error bars by the value of the data point itself. By making that plotting more complex than it needed to be, I also made it wrong.

H.E. Lehtihet kindly verified my calculations and then pointed out that what I graphed was not what I had calculated. He was able to do this because I showed him the data and one graph, something I should have made available to everyone. I will do that now.

Now that I am graphing what I am calculating, things look more reasonable. For example, the female to female plus male ration for astronomers, which has the smallest statistics, exhibits the greatest year to year fluctuation, as one might expect. Now that I am graphing what I was calculating, it also shows the largest error bars, and the year to year fluctuations lie within these error bars, as one might expect.

What remains to is:

1.) Answer the "big" question, new to me as I have never dealt with whole populations before, of how one calculates purely statistical effects, if any, when one is dealing with the whole population. Does "sampling error" have any meaning when one has the entire sample? What do the sampling error estimates mean now that I seem finally to have properly displayed them? Since I am dealing with graduates from U.S. universities, do these errors mean that is some sort of "many worlds view" or "alternative universe" were the graduations to occur, that all other things being equal these errors indicate the likelihood of those results being reproduced.

In some way I am greatful for having stumbled so severely between making the calculation and plotting it. I now have errors that look more or less like I would expect, especially since I did not take the data myself and have no "feel" for it. I expect that that fact I am now dealing with a whole population would never have occurred to me, and I would have continued blindly on.

2.) Work through the remaining answers you all have so thoughtfully and kindly provided.

I attach two files. The first file, whose name ends in "1" contains the graphing of the errors problem. The second file, whose name ends in '2', has this error corrected.

Thank you all very much for your help.

Mark

Andrew Ekstrom

What are you trying to show? With the data in the #2 file, one can create a regression model. If I simply run a regression of Year, Degree and Year*Degree vs P(Female), I get a series of models that predicts the probability of a woman graduating with a STEM degree in your data set.

Degree

Ast P(female) = -13.665 + 0.007000 Year

Bio P(female) = -15.236 + 0.007900 Year

Chem P(female) = -16.541 + 0.008499 Year

Comp P(female) = -1.976 + 0.001116 Year

EE P(female) = -11.643 + 0.005922 Year

Mth P(female) = -2.932 + 0.001654 Year

Phy P(female) = -6.803 + 0.003494 Year

Model Summary

S R-sq R-sq(adj) R-sq(pred)

0.0388293 93.07% 92.79% 92.36%

These models are fairly accurate when P(female) is between 0.15 and 0.85. The Std Deviation and thus the error bars are going to be about the same everywhere. A Quadratic term for 'Year" is statistically significant. But, does not help much with the model. Check out the "Fitted Line Plots" in the attachment. No error bars on each reading. Just a pair of 95% Ci for the mean and 95% PI for the population.

(Technically, this is a logit regression. However, a logit model is a lot more difficult to deal with. Since the simple P(female) = F(year, degree) model works so well, I would just stick with that. )

James R Knaub

Interesting Andrew, but how would you interpret your set of linear equations? Perhaps the coefficients are current/instantaneous rates of change, comparable somewhat to one another, while the intercepts are partially arbitrary consequences of the year dating system?

In other words, "What does this mean?"

I have seen 'year' related to a dependent variable before, and I think it is prone to being over-interpreted. Maybe.

Mark Frautschi

Hi Andrew,

Thanks for your letter.

The second spreadsheet has one less mistake (my human error) that the first one. All I am trying to show here is (a.) whether I understand purely statistical errors and can correctly apply them and (b.) whether, once a bare minimum of error analysis is applied, whether the curves remain distinguishable from one another. In the first case I correctly calculated the sampling error but I mistakenly added the data point to its error when I graphed it. This gave the appearance of errors larger than the data points themselves and rendered all graphs indistinguishable from one another.

My mistake nevertheless let me to the question I asked, which is what, if anything, is the statistical sampling error when one has the entire population. So, now that I seem to have done what I intended at the outset, I still face the question of what do I want to do as far as error calculations are concerned.

To answer your question on a strategic level, the spreadsheets are cut down from a series of sheets that one steps through as though watching a presentation. I step through the fields pairwise (to avoid visual overload) as the fields become "more mathematical" a term I pretend I can define, but actually cannot. The fraction of women is worse, in the datasets I have chosen for electrical engineering. I then ask, what could be more mathematical than electrical engineering? The obvious answer is the graph I skipped, mathematics and statistics itself. That graph has remained relatively constant at around 35-40% over half a century. This completely disputes "more mathematical" trend I had attempted to show.

I use this to argue that math graduates "live" more of their time in math, whereas the other fields are, by comparison, mere visitors, in terms of the percentage of time they spend performing math. They are the most habituated to it. Thus, they are completely past one microphysiological and one macrophysicalogical effect I describe in my book, leaving only cultural effects and choice to explain the deficit below 50%. I can eliminate choice effects by comparing with the computer science data, which shows two trends influencing the choice to opt-in to that major. (The choice to opt-out is eliminated due to the fact that this data is for graduates only.) Unlike mathematics and statistics, the computer science data shows two choice-influencing peaks. The PC revolution peaking in 1984 (IBM PCjr and Apple Macintosh released) and the Web 2.0 peaking in 2006 (when Facebook reached 1 million users). Both peaks produced only a temporary rise in both male and female graduation rates. It would appear, writ large, that for the general population, computers shifted from something that largely other people did, to something I could do, to something that largely everyone consumes, but that few people actually build. I imagine that the cultural anthropologists will want me to test that hypothesis, somehow. Too bad we were not interviewing graduates for that half century.

As you might imagine, this argument is complex and suffers from many angles, for example, the difficulty in confirming the estimate of the size of the effects of implicit and overt bias. I am working with a 2014 paper due to Milkman of the Wharton School at the University of Pennsylvania to estimate the variation in bias between professions, based of a sample size of around 7000, if memory serves.

So, this argument is very much a work in progress, as I hope by now is clear.

John Watts

If you literally have a finite population, and if you literally have a full sample (i.e. a census) of that population on some well measured sample, the sampling error is tautologically zero.

the finite population correction factor is proportional to

sqrt( (N-n) / (N-1) )

where N is the fixed population size, and n is the sample size. as n -> N, fpc -> 0.

That said, this is taking you at your word, that you literally have data on the full population. that is rarely the case. However, the info above of the simple answer to your question, I think.

Jay M Fleisher

If you are truly dealing with an ENTIRE population you do not have to use any statistics. The use of statistics is to draw information from samples of a population

Mark Frautschi

Thanks to H.E. Lehtihet, James R Knaub, Erik Mønness and John Watts have pointed me to the same correction factor reducing the sampling error to zero as the sample size approaches the whole population, which is the case I am dealing with. If one of you would be so kind as to provide a reference, I would be most appreciative.

John Watts

Just google "finite population correction factor" and you'll find 1,000s of references.

Nilza Nunes da Silva

You can find on the book "SURVEY SAMPLING" . Leslie Kish, 1965. chapter two. The fpc in Smpling without replacement.

Andrew Ekstrom

James,

The regression equations above show that the percentage of women in the fields is increasing, on average, by small amounts. (Technically, the analysis should be logit transformed. But, that makes interpreting the results even worse.)

It is also possible that these data could use some Time Series analysis too. It would be possible to create a Time Series analysis with logistic data. (I know when this comes up. I don't know who to analyze for it.)

The differences between each field are statistically significant. The coefficients in front of the 'Field' term are probabilities. So, if we look at the best field on the list, Physics, (Yes, I am biased;-) the value 0.003494 means that the percentage of women in the field grows, on average, by 0.35% per year. The percentages of women in math and computer science are lower. The percentages of the other fields are increasing faster than physics.

As far as the error bars go, I like the 95% Ci and PI. That tends to make showing differences a lot easier. Placing an error bar on each point is a bit busy or visual analysis. From the point of view of statistics and ANOVA based analysis methods, the assumption that the variability is homogeneous means those error bars should be the same size.

James R Knaub

Andrew - Sounds reasonable. - Jim

James R Knaub

PS - Actually not sure what you meant by "error bars" being the same, and would be curious as to the standard error of the regression coefficient for time for each field, if you have them available.

Malvin N Janal

It seems to me that the problem starts with the statement that the variance of a binomial distribution is equal to N. As already pointed out, if p is the proportion of women in EE, the variance of a binomial distn is equal to p * (1-p) and the standard error, se, is equal to sqrt(var / N). Then, the confidence limits are p +/- z (se), where z is the critical value delimiting the desired confidence limits in a normal dist'n.

As to whether these computations are accurate when a finite population is available, allow me add another POV on parameter estimates. To me, the formula is designed to estimate the variance of the sampling distribution given N, not the variance of either that sample or the population (although it is expected to be a very good estimate of the latter quantity). The fact that this experiment draws from a finite population does not limit the number of samples that could be drawn. One could draw repeated samples of size N from the population (with replacement) and this formula would estimate that variability in that distribution. In this way, the SE is ever a statistic, never a parameter.

James R Knaub

Readers might be interested in looking up the term "superpopulation."

Jay M Fleisher

If you have an entire population and that would be very difficult with you do not need statist

statistic take care of sampling error if you sample an entire population you have no sampling error therefore you don't need confidence intervals etc.

Mark Frautschi

Forgive me for not keeping up with my own question. At some point in the intervening years I learned that when one randomly samples a population, the statistics of the binomial distribution apply, and that when one has the whole population, the statistics of the hypergeometric distribution apply. In the latter case, the sampling error is zero, since there is no possibility of obtaining a different estimate by choosing a different random sample. There is a smooth function that takes the sampling error from the binomial distribution to the hypergeometric distribution and this function approaches zero, as it must, as the sample size approaches the entire population.

This smooth function is called the Finite Population Correction (FPC) Factor and is equal to ((N-n)/(N-1))1/2 where N is the size of the whole population and n is the size of the finite sample. When the sample n is infinitesimal with regard to the population size N, the FPC = 1. When the sample n is the whole population N, n = N and the FPC = 0. The FPC varies smoothly in between these two extreme cases, as it must.

Reference: http://www.statisticshowto.com/finite-population-correction-factor/

Mark Frautschi

Again, I ask forgiveness for not keeping up with my own question.

In my last posting I should have written that there is a difference between "sampling with replacement" and "sampling without replacement". In the former case, I am not sure if it makes a difference whether one returns the sample to the population or expands the population by that amount. If I sample a few cc from a bucket of seawater and return those same few cc to the bucket before I sample again is it the same as starting over with a new bucket of seawater?

In the second case, sampling without replacement, eventually, I presume, you get to the whole population.

So, for the case of sampling without replacement one eventually moves from the binomial distribution to the hypergeometric distribution and the standard deviation follows that smooth function, becoming zero in the limit.

I also want to update you on the project that generated these questions, examination of error estimates in calculation of the gender ratio in U.S. baccalaureate Computer Science graduations from 1966-present in NSF/NCES IPEDS Completion Data. As I mention above the sampling error in terms of the numbers of graduates is zero, since one has the whole population, however, for a variety of reasons, dominated by the Nyquist-Shannon Sampling Theorem, the sampling in time, produces temporal errors of approximately ± 2.1 years per 1-year histogram bin.

Let me say that again. The temporal sampling error is on the order of TWICE the histogram bin width. This renders the NSF/NECS IPEDS Completion Data suitable for studying effects that vary slowly with respect to the ± 2.1 year per 1-year histogram bin limit.

I have incorporated the full analysis into my Partnership 2.0 manuscript and it also exists as a draft 15-page article I am preparing for submission to a peer-reviewed journal. You can follow these developments on the project page and request a copy for review (comments, errors and corrections, in particular, mathematical comments, errors and corrections, are most welcome) here:

https://www.researchgate.net/project/Partnership-20-Why-are-There-Still-so-Few-Women-in-the-Mathematically-Intensive-Fields-Answers-that-Forward-New-Partnerships-from-Personal-to-Global-Toward-Gender-Equity-as-Merely-Normal

Thank you for all your help. Reducing my concerns about count sampling uncertainties (in the vertical dimension of histograms) allowed me to then consider sampling uncertainties in the time dimension of the histogram (the horizontal dimension of histograms), that is, the bin placement, which, with a moment's consideration, are enormous. In retrospect, it would have been nice to have done this the other way 'round. As they say, life is lived forwards and understood backwards.

Mark Frautschi

I have a question about acknowledgement. In my previous academic life, which was largely when the web was young, from say 1992 on, in my academic articles I would occasionally thank various individuals for "useful conversations" which, more often than not, were when this or that person helped me replace my wrong thinking with their right thinking.

Now I am working as an independent researcher, and I no longer have institutional support which I tend to equate with hallways where one encounters colleagues to ask questions and extract favors of various kinds, including reading drafts and these "useful conversations." ResearchGate has to some extent, served this extremely valuable role. Because of this question thread in particular and someone at the American Physical Society who helped me select the data, and another person at the University of Maryland, who checked my math, I have, at least, a better article, and perhaps this is why I have an article at all.

I think I know how to thank those two individuals, but what about this question thread? Should I list the names of everyone who contributed to this thread? It is on the order of fourteen names. Or, do I acknowledge the thread as a whole or ResearchGate as a whole? Ultimately, there is nothing from the thread in the article because I was able to remove my doubts about the sampling error being zero when one has the whole population. These "useful conversations" prevented me from contaminating my article with this red herring. (In English-speaking cultures a "red herring" is often used as an idiom for a non-concern.)

So, I would like to acknowledge this question thread, somehow. Do I thank everyone who contributed by name? Do I thank the thread as a whole? Do I thank ResearchGate as a whole?

I welcome your thoughts and I thank you again for the clarity you have helped me achieve and for keeping a non-issue out of my article.

Mark Frautschi

Here is what I propose for the acknowledgements section of the article:

Acknowledgements

###### #########, then of the American Physical Society, was helpful guiding me through the various WebCASPAR selection criteria (listed below) to reproduce some of her graphs (American Physical Society, 2014) of the NCED/IPEDS data. ##### ### of the University of Maryland, ###### ###### of the University of Virginia and several ResearchGate contributors[i] provided useful conversations.

• • •

[i] https://www.researchgate.net/post/How_do_I_calculate_statistical_sampling_error_and_confidence_limits_as_my_sample_approaches_the_entire_population

Z. A. Al-Hemyari

Please refer to:

1. Landau and Everitt (2004). A Handbook of Statistical Analyses using SPSS.

2. Howitt and Cramer (2008). Introduction to SPSS

Z. A. Al-Hemyari

Please refer to:

- Howitt and Cramer (2008). Introduction to SPSS, Pages 249-258 ;

- Validity and Reliability of Students and Academic Staff’s Surveys to Improve Higher Education. Educational Alternatives, Journal of International Scientific Publications, 2016, Vol.14, pp. 242-263 and

- Statistical characteristics of performance indicators. Int. J. of Quality and Innovation, 2014, Vol.2, No.3-4, pp. 385-309.

Regards,

Mark Frautschi

Dear Dr. Al-Hemyari,

Thank you for your kind references.

Mark,

Dhritikesh Chakrabarty

As the sample approaches the whole population, the sampling error approaches zero.

Accordingly, there is no necessity of calculating sampling error.

Consequently the estimate of the parameter to be known approaches the true value of the parameter . Hence, there is also no necessity of computin confidence interval. The value of the parameter to be known can directly be computed from the data in that case.

Badges
Science topic

More Mark Frautschi's questions See All

How can I obtain the missing masthead and illustration?

Dear Clem Herman, Susan Lewis and Anne Laure Humbert, Thank you for your kindness in sending the "nearly final" version of your article. Unfortunately, I do not know how to cite it or file it...

04 May 2015 6,457 0 View

Where can I find a list of the known cortices?

I am looking for a reference with a list of all of the known cortices. I'd like to include reference to all of the six layers, if possible. For example, Google Scholar has helped me find that...

04 May 2015 6,394 14 View

Do you know how to simulate an adjustable current-limiting diode in vanilla SPICE (MacSPICE 3f5)?

I need a simple, passive, current limiter: adjustable 1uA-1A, one that can never be tricked during simulation into sourcing current or voltage. It can remain fixed while SPICE simulates the...

07 August 2014 8,618 6 View

Would you kindly provide an overview of your project?

Hello, I'd like to learn more about your project, beginning with the basics. What is your research aim? Why is it important? What is the expected timeline? What methods to you plan to use at the...

01 January 1970 2,181 5 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Geotechnical Engineering (Proceedings of the ICE) time review?

Hello everyone, I recently submitted an article to Geotechnical Engineering (Proceedings of the ICE), and the current status has been listed as "EiC Pre-assessment: Ready" for the past 20 days. I...

10 August 2024 6,493 1 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

How are iso-frequency contours plotted?

Let's say we have a standard, regular hexagonal honeycomb with a 3-arm primitive unit cell (something like the figure attached; the figure is only representative and not drawn to scale). The...

07 August 2024 1,937 1 View

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

A fungal strain was treated with nanoparticles. We want to do an environmental SEM analysis. So could anyone share your views on preparing the sample? Thank you.

07 August 2024 5,307 1 View