Which imputation techniques are suitable for handling missing data in data sets with non-Gaussian distribution?

First, try to verify whether the data points are missing "completely at random", that means independent of the value. If, for a variable, specific values are more likely to not show than others, this needs to be taken into account. When it can depend on the values of other variables, it becomes even harder.

For now, I'll assume they are missing completely at random.

Any prediction algorithm for the variable (classification for discrete variables, regression for numerical values) can be used for imputation. Try using a number of different predictors to obtain robustness, for example use 5 different algorithms and use majority voting (for classification) or the mean (for regression).

There are many algorithms out there, and some might be better suited for your data set than others.

Robby Goetschalckx

For now, I'll assume they are missing completely at random.

There are many algorithms out there, and some might be better suited for your data set than others.

Jahangir Pachkam

Hi, I get the idea to use relieff algorithm. For a better understanding to read the following article relieff algorithm:

James R Knaub

Razieh -

Regarding missing data, there are varying 'degrees' of 'missingness,' one of which Robby noted as MCAR (missing completely at random). In terms where we discuss nonresponse on a survey, this is considered an " ignorable nonresponse." That does not mean it is really ignored, as explained below.

I worked in establishment surveys (finite populations) using continuous data, and had to look up "NaN." Is that division by zero in a program because a missing number is incorrectly interpreted by the software as a zero? That sounds like you could have another problem of which I am familiar: having a difficult time determining when you have a nonresponse vs a response of zero. However, note that my experience is with estimation for finite populations using continuous data. You will need to decide what, if anything I have here, applies to your project.

If your question is one of continuous data missing randomly from a non-Gaussian distribution, then I think multiple imputation could help you get a good picture of what is happening. However, for 'single imputation,' if you substitute a mean for each missing datum then that may be OK if either there were enough cases to have errors somewhat cancel each other, or you did not have to impute enough to really matter. But imputing mean values will artificially lower any variance estimates if you include those imputed values. Well, there has been work done to try to account for that. There was at least one interesting paper on that at the International Conference on Survey Nonresponse in 1999.

If you have regressor data related to the variable of interest, you can impute by "prediction" and estimate the "variance of the prediction error" to know something about the uncertainty involved by making these imputations. The variance of the prediction error is an econometrics technique for each individual missing datum, but can be applied to estimated totals with a bit of manipulation to the 'formula.' For an individual datum, the square root of the estimated variance of the prediction error in SAS PROC REG, for example, is SDSI.

Your best method of imputation depends on what your overall project is about and the kinds of data. Below I will edit in some remarks I have used to answer another question on missing data.

There are two major types of nonresponse: "ignorable nonresponse," and "nonignorable nonresponse," the ideas of which would apply to any missing data, i suppose, as the first just means that there is no special reason for the nonresponse, so it was as if a random selection is missing, and the later means those missing data are - in a way - 'unrepresented' by the data collected. I am only familiar with continuous data, but there are several methods to try to account for the missing data by imputation ("hot deck," "nearest neighbor," etc.) and then there is (re)weighting of the data if you are trying to look at an estimate for a finite population. To impute a mean would usually mean you are expecting that the data were missing at random, so that would be true for ignorable nonresponse, but you could group your data by the reason for the missing data. Often these reasons will actually give you different response rates for different groups of data that you may identify, called "response propensity groups." If you group your data that way, and impute or reweight within groups, then you could obtain more 'representative' results. But whenever possible, I think regression (so-called "prediction") is very useful. It means that you can use data you already have, and the relationships between these data, to good advantage. Note that the individual standard errors for the prediction errors will somewhat 'cancel,' to a degree, so that the variance of the prediction error for an estimated total may be much better than one might expect.

So, for your continuous data, at least, some if the terminology above might, if you search the Internet, give you some ideas as to what might be the best way(s) for you to handle your missing data.

At any rate, I think that the fact that your data do not follow a Gaussian distribution is not the biggest problem. It can be a problem if - for example - you had a missing value that if observed would have been in the long tail of a skewed distribution, and you did not know it. However, if you are able to group your data in some way - based on characteristics of the data or perhaps response propensity - then this should help deal with just about any kind of problem you might have, including skewed data.

Cheers - Jim

James R Knaub

Thank you Sebastian. I looked up "NaN," and saw "division by zero," which made no sense to me in the context of Razieh's question. What you said works for me.

I'd only worked with SAS, SPSS, BMDP, FORTRAN, and JMP and EXCEL, if that counts, and had never seen "NaN" before (unless I forgot from limited, long ago use of SPSS and BMDP).

BTW, you may find it 'interesting' that in the official statistics/estimations I worked on for establishment surveys, when some survey designs and/or data collection systems made it difficult to tell a missing number from a reported zero, that was a big problem. A missing number may have been very large, if collected. I think people know that, but then get lost in the details and forget or get confused by other factors. One office even told me that they did not make imputations for missing numbers, apparently not realizing that yes, they were imputing, zeroes, and thus biasing every reported total downward. Seems hard to believe, but it happens. - I guess that was a corollary to the maxim that "Making no decision, is a decision" ... by default.

What are the appropriate and usable questions in the phenomenological method?

Can anyone explain about Resazurin assay calculation?

How to synthesis of SBF (simulated body fluid) for bioactivity tests ?

Hi, Is there any way I can remove bubbles from water?

Is there any techniques to find the relative abundances in synthetic communities comrised the same species but with different genetic modifications?

What is the best server or tool to design an immunogenic epitope ?

How to detect ghost texture and fuzzy zoning in CL images of zircon? What characteristics do they have?

Scopus indexed journal about art with low acceptance charge or without charge?

Why we have not color change in RT-LAMP assay with we are working with Warmstart colorimetric 2X master mix E1804?

Why I see no changes in hardness of AlCu alloy after 2 months natural aging?

Feedback defines the constitution of an organism?

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

How are iso-frequency contours plotted?