Hi Everyone,
To be honest I have never heard of or used this function in SPSS. However some fellow researchers told me that it is often used by researchers and it is encouraged by scientific publishers as well. Although I am not planning on using it, I wonder how ethical it really is to replace missing values in SPSS? Does anyone have experience of it or have used it? I am just interested the ethical issues with it (if there is any).
Thank you
Dora -
There are many kinds of missing data problems. What they all have in common is that you need to account for the missing data based on data you already have. - My experience with this is with continuous data for establishment surveys, but I think the common thread is that one needs to consider all relevant data that you do have.
In survey statistics, we may need to account for missing data or else an estimate of a total will be biased downward. I have heard someone in such a situation saying that their survey did not impute (or reweight) data for missing values. Well, actually that means that they really were imputing numbers: zeroes. That's no bueno!
In a time series, a missing value might be replaced, depending upon method, by something close to what you might consider an interpolation or an extrapolation, and may not be very reliable, but it might also be estimated by use of additional correlated information (regressors) available from other sources.
Many people use multiple imputation now, as a way of simulating to indicate variance. Imputing a mean value, for example, in a dataset will artificially lower the variance of that dataset. In addition, it may be biased if there is a nonignorable mechanism, such that the missing value is representative of data not like the mean of the other data. There are many ways to look into that: response propensity groups and other stratifications, and various models. For a cross sectional survey, regression model-based prediction can be used, which makes good use of other data you may already have, and you can estimate the variance of the prediction error. The square root of that for an individual item nonresponse is STDI in SAS PROC REG.
Speaking of software, I have not used SPSS in many years, and do not know if the utility, or whatever it's called, to which you are referring is for a given experimental dataset, a time series, a survey, or what, but it all boils down to using the data that you do have efficiently, and always remembering that an imputed value is a kind of estimate, and not to be confused with a real observation. Your real sample size does not include imputed values.
Think of a sample from a finite population. The population size is N. The sample size is n. Is it ethical to estimate for the N - n values you do not have? Yes it is. In fact, a sample (with sampling error and nonsampling error) can often be more accurate than if you did a census (with just nonsampling error, such as measurement error), because straining resources to collect everything can get very messy. For an establishment survey, the smallest members of the population often have a very hard time providing reliable information on a frequent basis, so official statistics can be collected frequently as a sample, and less frequently as a census. This gives a relationship to predict for missing data in the sample surveys.
Anyway, one can think of the N-n cases 'out-of-sample' as being "missing," in addition to any nonresponse. Also note that when survey weights are used, they can be inflated (called "reweighting") to account for nonresponse, rather than using either single or multiple imputation. If a sample size is n, but there are two nonresponses (or data removed in editing) then the real sample size is n - 2.
The US Census Bureau often finds it necessary to sample. Further, when they have had missing data, they have used a variety of techniques (such as "nearest neighbor," "hot deck," etc.). But one needs to account for variance, and I heard a very interesting talk at the International Conference for Survey Nonresponse (ICSN) in Portland Oregon, USA, in 1999, which resulted in this book chapter:
Lee, H., Rancourt, E., & Särndal, C. E. (2002), "Variance estimation from survey data under single imputation," Survey Nonresponse, 315-328.
(Book from invited sessions of the International Conference on Survey Nonresponse (ICSN), 1999.)
- Plus note that there are other papers by these authors on this topic or similar.
I have noted one person in another context, as I recall, referring to imputation as "feeling like cheating," but like the person who told me that they did not impute, when someone says this, in my opinion, they do not appear to understand the ramifications of not accounting for missing data. However, you always need to remember that, just as in sampling, a value that was not collected/observed, is estimated and not a real collected value. It can not generally be used to help estimate for other data. It is sort of a "taker," not a "giver." It relies on other data you do have, and is not useful in estimating for other missing data.
Estimating for missing data is very important. Yet it leads to a funny story I heard Bob Groves (famous for the topic) say at his keynote for a conference (maybe the ICSN). He said his young boy (at the time) asked him what he did. He said he thought he was doing well, explaining this to his son, but I think he said he then noticed his son looked puzzled and said something like "So ... you study ... nothing." (This got a big laugh.) :-)
Cheers - Jim
Hello Dora. There is now quite a large literature on missing data. See below for a few articles I've found useful in becoming familiar with the basic issues. HTH.
http://onlinelibrary.wiley.com/doi/10.1111/j.1741-3737.2005.00191.x/pdf
http://www.nyu.edu/classes/shrout/G89-2247/Schafer&Graham2002.pdf
http://www.stats.ox.ac.uk/~snijders/Graham2009.pdf
Hi Dora,
I agree completely with Bruce. There is a vast amount of literature on missing data and it is mostly recommended to use multiple imputation or maximum likelihood over listwise and pairwise deletion. Attached please find some more articles/chapters about this topic. If you have any further questions please let me know.
Best regards,
Felix
Dora -
There are many kinds of missing data problems. What they all have in common is that you need to account for the missing data based on data you already have. - My experience with this is with continuous data for establishment surveys, but I think the common thread is that one needs to consider all relevant data that you do have.
In survey statistics, we may need to account for missing data or else an estimate of a total will be biased downward. I have heard someone in such a situation saying that their survey did not impute (or reweight) data for missing values. Well, actually that means that they really were imputing numbers: zeroes. That's no bueno!
In a time series, a missing value might be replaced, depending upon method, by something close to what you might consider an interpolation or an extrapolation, and may not be very reliable, but it might also be estimated by use of additional correlated information (regressors) available from other sources.
Many people use multiple imputation now, as a way of simulating to indicate variance. Imputing a mean value, for example, in a dataset will artificially lower the variance of that dataset. In addition, it may be biased if there is a nonignorable mechanism, such that the missing value is representative of data not like the mean of the other data. There are many ways to look into that: response propensity groups and other stratifications, and various models. For a cross sectional survey, regression model-based prediction can be used, which makes good use of other data you may already have, and you can estimate the variance of the prediction error. The square root of that for an individual item nonresponse is STDI in SAS PROC REG.
Speaking of software, I have not used SPSS in many years, and do not know if the utility, or whatever it's called, to which you are referring is for a given experimental dataset, a time series, a survey, or what, but it all boils down to using the data that you do have efficiently, and always remembering that an imputed value is a kind of estimate, and not to be confused with a real observation. Your real sample size does not include imputed values.
Think of a sample from a finite population. The population size is N. The sample size is n. Is it ethical to estimate for the N - n values you do not have? Yes it is. In fact, a sample (with sampling error and nonsampling error) can often be more accurate than if you did a census (with just nonsampling error, such as measurement error), because straining resources to collect everything can get very messy. For an establishment survey, the smallest members of the population often have a very hard time providing reliable information on a frequent basis, so official statistics can be collected frequently as a sample, and less frequently as a census. This gives a relationship to predict for missing data in the sample surveys.
Anyway, one can think of the N-n cases 'out-of-sample' as being "missing," in addition to any nonresponse. Also note that when survey weights are used, they can be inflated (called "reweighting") to account for nonresponse, rather than using either single or multiple imputation. If a sample size is n, but there are two nonresponses (or data removed in editing) then the real sample size is n - 2.
The US Census Bureau often finds it necessary to sample. Further, when they have had missing data, they have used a variety of techniques (such as "nearest neighbor," "hot deck," etc.). But one needs to account for variance, and I heard a very interesting talk at the International Conference for Survey Nonresponse (ICSN) in Portland Oregon, USA, in 1999, which resulted in this book chapter:
Lee, H., Rancourt, E., & Särndal, C. E. (2002), "Variance estimation from survey data under single imputation," Survey Nonresponse, 315-328.
(Book from invited sessions of the International Conference on Survey Nonresponse (ICSN), 1999.)
- Plus note that there are other papers by these authors on this topic or similar.
I have noted one person in another context, as I recall, referring to imputation as "feeling like cheating," but like the person who told me that they did not impute, when someone says this, in my opinion, they do not appear to understand the ramifications of not accounting for missing data. However, you always need to remember that, just as in sampling, a value that was not collected/observed, is estimated and not a real collected value. It can not generally be used to help estimate for other data. It is sort of a "taker," not a "giver." It relies on other data you do have, and is not useful in estimating for other missing data.
Estimating for missing data is very important. Yet it leads to a funny story I heard Bob Groves (famous for the topic) say at his keynote for a conference (maybe the ICSN). He said his young boy (at the time) asked him what he did. He said he thought he was doing well, explaining this to his son, but I think he said he then noticed his son looked puzzled and said something like "So ... you study ... nothing." (This got a big laugh.) :-)
Cheers - Jim
Dear Dora, in ordert ro judge about the ethical or not of such a method, next question has to be answered:
Dear Demetris,
that is a very good point. A decent researcher not only mentions the imputation method, but also the number of times it was used, performs two data analyses, one with and one without missing data imputation and presents the outcomes in the results section.
Best, Felix
Felix - That is often true, but in the case of say, survey sampling for continuous data, there is no estimating a total from a finite population without some imputation or reweighting for missing data, unless you impute zeroes in one case and have reasonable imputation in another just to see how much is imputed. - Jim
My advice is if you can avoid replacing missing values because you are introducing unknown value of bias which could lead to more precise statistics but also the thing could go on the opposite way. So, if you have small number of missing values I would exclude those units with the missing values. In all other cases I would consider to use some of missing values estimation methods.
Dear All,
If you compare my global temperature found for Northern Hemisphere (1800-2013) by using only full monthly data (thus by dropping away the incomplete data), then you will find significant differences.
So, deciding for impute or not impute is critical.
Personally I have decided to use only complete data.
My regards to all.
http://www.sciencedirect.com/science/article/pii/S1364682615000577
What is the rationale for replacing missing data with some other data?
I would say that whatever we have we should RESPECT and hence refrain from manipulation. Otherwise we are deviating from objectivity - benchmark of scientific approach.
Well, as I said above, for estimating totals in a finite population, you do not have the option of ignoring missing data or else you really are imputing zeroes, and biasing results downward.
Also, if you ignore missing data, you may have nonignorable nonresponse, where data are not missing at random, and once again, you bias results by over-representing the groups that do respond.
You are respecting the data when you account for missing data by making good use of the relevant data you do have.
And as I said before, if you do not believe in accounting for missing data, then you do not believe in sampling either.
Dora,
You may get benefit from the attached paper. It has different methods to impute the missing data in R.
Conference Paper Comparative Statistical Algorithms for Imputation of Missing...
Hi Dora,
Unless you have sold your soul to the devil, you will have missing data at some point, and when that happens, we have to do something about it, though I wonder too about the merit of replacing missing data...
When you have missing data, one option involves ignoring/deleting any cases in your sample who have incomplete data (but this reduces your sample size and could drastically reduce your sample size if lots of people have just left blank a question or two). Another option is to only ignore/exclude cases in your sample who have missing data on the variables in any particular analysis (this will mean that your sample size can vary by analysis). But the options available for dealing with missing data are many and each has strengths/weaknesses and are more/less appropriate depending on the amount and pattern of missing data in your sample (e.g. item mean replacement [widely regarded as one of the worst options but is by far the easiest to use], multiple imputation, expectation maximisation etc.). This book provides a really good basic introduction to handling missing data (and data cleaning in general): https://au.sagepub.com/en-gb/oce/best-practices-in-data-cleaning/book235006
When you have missing data, you actually do have a smaller sample size, and more for which you have to estimate, at least in estimating a total for continuous data from a finite population, or else, there is that downward bias I noted. That is why I said the intended sample size n is reduced by the number of nonresponses (or edit failures).
You can certainly have "item nonresponse," so that each variable of interest on such a survey actually has a different true sample size. That is a common practice.
For design-based and model-assisted design-based survey sampling and estimation methods, the survey weights can be reweighted to account for missing data if the nonresponse is considered at random within the given stratum - where stratification itself could help if the nonresponse mechanism were otherwise nonignorable. The other option is to impute in one of many ways, which make use of data you have. But those imputed numbers do not 'count,' or act as part of the sample.
One of the best imputation methods would be 'prediction' (bad choice for the word when not meaning forecasting, and forecasting is not meant here), where we could really use prediction for all missing (out-of-sample) data, which is referred to as model-based estimation. But this requires correlated regressor data for the population. Model-based methods can be very helpful, making good use of available regressor data. The regressor data, rather oddly, are called auxiliary data when used for model-assisted design-based sampling and estimation.
Dora -
On the question of ethics:
Missing data can occur in different contexts, but a very important one is estimation of totals or means for continuous survey data, from finite populations, with which I worked extensively, to provide Official Statistics on US energy data, such as total electric sales and generation in various categories. If you do not think it is ethical to estimate for missing data in such a case, then you do not believe it is ethical to sample. Yet the US Bureau of the "Census," and many others, do a great deal of sampling, not only because it is less expensive, but also because it is often more accurate to sample than to stretch resources and obtain a census, with a great deal of measurement error, and nonresponse, and other issues which could cause a great deal of bias and variance. That would be irresponsible. (They also have done a great deal of imputation and/or reweighting.) It is very ethical to do the best job possible. Hiding what is done would be unethical.
Consider then that it would be bad to not estimate for missing data, leaving the results perhaps badly biased, and not take advantage of the great deal of work done by many (highly ethical!) statistical scientists around the world who have worked hard for many years to do the best job possible to estimate for missing data, whether out of sample or nonresponses! To ignore that would not be the responsible thing to do.
To be unethical implies that an estimate for a given missing response - either missing because it was out of sample, or because there was nonresponse or edit failure - is being represented as an actually collected response. Just don't do that. We flagged the data in our files as to which were not actually observed responses. Metadata, such as found in the technical notes section of a data publication, can be used to report data collection methodology, sampling, estimation, imputation, etc. Standard errors of the prediction errors were estimated in methodology that I developed, and overall accuracy was investigated by various people, applied to test data, and other comparisons.
If missing data are ignored, that is what can seriously bias results.
It is highly ethical to estimate for missing data and say how it was done so that results are repeatable. This is not to say to just make something up that suits you. That is not what we are saying.
Thank you - Jim
Dear Dora,
I think it is always scientifically unethical to impute data (replace missings) because one way or the other the researcher is adding data that are not measured.
But if you really need to:
1. Inspect if the missings are at random: if there is a relation between missing on one variable and the value of another variable, than there is a problem. Google on 'random missing' 'totally random missing' and so on. The accepted amount of randomness in missings varies with the type of analysis.
2. When replacing missings: use a stochastisc procedure with variables that do not occur in your analysis.
Cheers,
Pierre
Above it says that it is "unethical" to be "...adding data that are not measured," but that is exactly what we do when we estimate from a sample of a finite population to do statistical inference to that population.
............
As one example, the simplest design-based method, simple random sampling, applies the principle of data being missing at random. (The merits of sampling, rather than a census for all official statistics, for example, have been established since the 1940s.)
This same idea for comparisons tracks through to all estimation from sampling, including 'prediction' from model-based methods, where we can look at methods for data not missing at random.
.....
Please, let's not continue to throw the word "unethical" around. It means more in English than some may realize. And I do not believe that being a statistician automatically makes you "unethical." :-)
............................
PS - Whoever downvoted this, I'd like to see them try to explain what they possibly think could be wrong with it. I'd also like to know how they could refute the massive literature on imputation and/or reweighting for missing data. Also, are they going to charge major statistical agencies for official statistics with unethical behavior, and universities and many book and paper authors with fraud? And if they think that no data not observed can be used, they should explain how they expect to census all finite populations, without sampling, without nonresponse, and without massive measurement and other nonsampling error.
Hi Everyone,
wow! thank you for the papers and all the answers - I haven't been up here for a few days and I was delighted to see all the answers! THANK YOU! I feel more knowledgeable about the matter now :) You see, I just found really strange that replacing missing values is actually encouraged by (some) publishers.
I have also read that even if the missing values are replaced, results will not be significantly different than if I had not replace any missing values just left it as it is (so no sig. difference between results with replaced values or missing values). Forgive me please, I will try to find the paper that said this and try to upload here. But I decided to try myself as well on my own data (just by creating copies of my dataset) and to see for myself. Because it may be a very trivial topic -especially for those of you who are a lot more experienced than me- for me it is really quite interesting.
I think it is ethical provided that in your paper you describe
1) why you needed to replace them,
2) what points exactly were replaced,
3) what was the method used and why you used that specific method (and not the alternatives).
Dear all,
I share the same opinion of James Knaub, a very good statistician.
Thank you
Helena
P.S. If you want to deal with missing values in the most unbiased and objective way, you can employ Bayesian estimation where you treat your missing values as unknown variables with uninformative priors. This will lead to most objective confidence intervals for your variables of interest, however this will require more time for performing statistical analysis.
Replacing missing values in a correct way by using multiple imputation is never manipulating data when done correctly. The procedure is designed to deal with the fact that some of the data are prediced (imputed) instead of observed, and standard errors and degrees of freedom are corrected for this. Therefore inferences are correct. This cannot be achieved with any single imputation procedure. There are two general ways of doing multiple imputation, both are Bayesian techniques. The first simulates the multivariate distribution of the data by assuming some kind of probability distribution for the whole data set (e.g., the multivariate normal distribution, as used by Schafer in the NORM software). The second approach is called Fully Conditional Specification (for instance implemented in SPSS), and uses several distributions, depending on the nature of the variables that are imputed.
Dear all,
I agree with James and in may opinion is the more accurate and logic answer .
When we imput missing data we are imputing, creating new data that we assume to be similar to the one that could have been if everybody answered. An assumption..
Have a nice day
Helena
PS i really do not understand the downvote...it is science!
If you use an imputation technique that can be cited and has been used before, then I see no issue with imputing your missing data. In your manuscript, just be sure to describe the technique you use (with citation). You should also provide some descriptive statistics on the missing values (number of missing values, do missing values occur more in one group, etc.).
Dear Michael,
You are right..it is an assumption...but what James said is that we are imagin a good replacement using good methods...but what is precision anyway? Whatever method, replacemnt means a "gap" that is fulffiled in the best way...with error, a minimum error , is want we want...of course. Then we are being creative...always trying to find replacements for everything...why we do not maximize what we have? Would not become more precise?
I think so.
Admit an organization: if you miss your work (eg pregnant, sick, etc.) the replacemnt technique of missings would cause another to your place. Would you find it well? It would be correct?
Have a nice time
Helena
Also please note the following on variance (as well as the above on bias):
In model-based estimation of category totals for finite populations of continuous data, each missing value, whether out-of-sample or nonresponse, is 'predicted' (estimated, not a forecast) by regression, so single imputation, and the variance of the prediction error for the totals (a little more complex than adding the variance of the prediction errors for the individual missing data cases) can be used to obtain relative standard errors or confidence intervals for those estimated totals. So although the predicted values are on the regression line, we still estimate variance (of the prediction error).
The proposition that there are real values and made-up values is itself a fallacy. True-value theory tells us that all representations of data, even those which are "objectively" observed are a combination of 3 possible components: the true (perhaps unknowable) value of what's being investigated, some sort of systemic error (perhaps not present if proper controls have been put in effect) and random error.
So, when we think about imputation or other methods for "making-up" missing data, what we're talking about is not real data vs. fake data, but the degree of error in the measurement. It's theoretically possible that the imputed value is actually closer to the "one, true, unknowable" value than the actual observed value, if the observation was sloppy or faulty, the transcription is in error (etc.) - it's a matter of the direction of the respective errors. The imputed value may be wrong in a direction that brings it close to the actual value inherent in what's being evaluated - no way to tell. Perhaps the two errors would cancel each other out.
Truly all a statistician/researcher can do is their best to minimize systemic error through controls, and (this is the hard part) think about the methods they will use to construct missing data - if they use the same method every time, they're likely doing it wrong. Some methods depress variance while increasing n - this changes the relationship upon which significance is based; others increase variance while increasing n.
The best reason to "make-up" data is to stablilze power - one should attempt to do this without unduly impacting alpha.
Also keep in mind that we are not so much interested in the exact value to impute. An imputation model does not respresent causal relationships and the exact but unknown answer to a specific items. Rather, the model is a device to preserve important features (relations) of the observed data in the completed (imputed) data.
Besides the ethical issues, missing data can be solved with Bayesian approach or Regression analysis, or employing EM algorithm
Dora,
The practice of imputation for missing data is widely-accepted so long as the methodology being employed is sound. The imputation function built into SPSS is an acceptable form to use. As a statistical reviewer for journals, I have no problem with imputation methods being applied to the data so long as the method itself is clearly and succinctly explained and that it is referenced with either a textbook or a peer-reviewed journal article. Usually this explanation is provided in a Statistical Analysis subsection of the Methods section.
Mike
Hi Dora,
You may look at the Missing Data Imputation algorithms provided in the following two papers. These algorithms were implemented in R. I can share with you the codes if you think they are useful to you.
Best,
Watheq
Conference Paper Comparative Statistical Algorithms for Imputation of Missing...
Conference Paper Efficient Imputation of Incomplete Petrophysical Dataset thr...
I have a follow on question regarding Missing Data. If you have used an missing data remedy for e.gMaximum Likelihood Estimation, on all items within a scale. Is it advisable, preferred etc. to calculate scale reliability after applying Missing Data remedy or before (raw data)?
If you are referring to internal consistency of a scale (i.e. Cronbach's alpha), I've always done this after dealing with missing data because internal consistency is tested at the scale/questionnaire level whereas missing data is dealt with at the item level/individual question. I don't know what is considered preferable, but I've never tested the internal consistency of a scale using the items with missing data.
Thanks Nicola I really, appreciate your response - it has validated my thinking. It made sense to me that just like you would deal with outliers first, that if you decided to apply a missing data remedy, that the internal consistency of a scale would be calculated post not pre. It is my first time using the MLE method for missing data, so I really appreciate you taking the time answer. Thanks so much.
If accuracy really matters, I highly recommend this JMLR 2007 paper:
Handling Missing Values when Applying Classification Models http://jmlr.csail.mit.edu/papers/volume8/saar-tsechansky07a/saar-tsechansky07a.pdf
"Strikingly the reduced-models approach, seldom mentioned or used, consistently outperforms the other two methods, sometimes by a large margin."
I know i'm little late to this thread, but here is my experience with handling missing values.
1. Remove them
2. Replace NA's or NULL with either, mean, median or mode's
3. Replace them with zero's
4. Imputation method using permutation techniques with algorithms such as pam, regression and classification techniques using distributions etc. There is a whole package in R dedicated to this called mice.
I did a study on imputation and how it affects the results. Please see the attachment for my results.