I have found imputation methods based on statistical analysis and imputation methods based on machine learning. However, I do not know whether I can use them safely to impute NaNs in non-Gaussian datasets? Any research in this area?
First, try to verify whether the data points are missing "completely at random", that means independent of the value. If, for a variable, specific values are more likely to not show than others, this needs to be taken into account. When it can depend on the values of other variables, it becomes even harder.
For now, I'll assume they are missing completely at random.
Any prediction algorithm for the variable (classification for discrete variables, regression for numerical values) can be used for imputation. Try using a number of different predictors to obtain robustness, for example use 5 different algorithms and use majority voting (for classification) or the mean (for regression).
There are many algorithms out there, and some might be better suited for your data set than others.
First, try to verify whether the data points are missing "completely at random", that means independent of the value. If, for a variable, specific values are more likely to not show than others, this needs to be taken into account. When it can depend on the values of other variables, it becomes even harder.
For now, I'll assume they are missing completely at random.
Any prediction algorithm for the variable (classification for discrete variables, regression for numerical values) can be used for imputation. Try using a number of different predictors to obtain robustness, for example use 5 different algorithms and use majority voting (for classification) or the mean (for regression).
There are many algorithms out there, and some might be better suited for your data set than others.
Regarding missing data, there are varying 'degrees' of 'missingness,' one of which Robby noted as MCAR (missing completely at random). In terms where we discuss nonresponse on a survey, this is considered an " ignorable nonresponse." That does not mean it is really ignored, as explained below.
I worked in establishment surveys (finite populations) using continuous data, and had to look up "NaN." Is that division by zero in a program because a missing number is incorrectly interpreted by the software as a zero? That sounds like you could have another problem of which I am familiar: having a difficult time determining when you have a nonresponse vs a response of zero. However, note that my experience is with estimation for finite populations using continuous data. You will need to decide what, if anything I have here, applies to your project.
If your question is one of continuous data missing randomly from a non-Gaussian distribution, then I think multiple imputation could help you get a good picture of what is happening. However, for 'single imputation,' if you substitute a mean for each missing datum then that may be OK if either there were enough cases to have errors somewhat cancel each other, or you did not have to impute enough to really matter. But imputing mean values will artificially lower any variance estimates if you include those imputed values. Well, there has been work done to try to account for that. There was at least one interesting paper on that at the International Conference on Survey Nonresponse in 1999.
If you have regressor data related to the variable of interest, you can impute by "prediction" and estimate the "variance of the prediction error" to know something about the uncertainty involved by making these imputations. The variance of the prediction error is an econometrics technique for each individual missing datum, but can be applied to estimated totals with a bit of manipulation to the 'formula.' For an individual datum, the square root of the estimated variance of the prediction error in SAS PROC REG, for example, is SDSI.
Your best method of imputation depends on what your overall project is about and the kinds of data. Below I will edit in some remarks I have used to answer another question on missing data.
There are two major types of nonresponse: "ignorable nonresponse," and "nonignorable nonresponse," the ideas of which would apply to any missing data, i suppose, as the first just means that there is no special reason for the nonresponse, so it was as if a random selection is missing, and the later means those missing data are - in a way - 'unrepresented' by the data collected. I am only familiar with continuous data, but there are several methods to try to account for the missing data by imputation ("hot deck," "nearest neighbor," etc.) and then there is (re)weighting of the data if you are trying to look at an estimate for a finite population. To impute a mean would usually mean you are expecting that the data were missing at random, so that would be true for ignorable nonresponse, but you could group your data by the reason for the missing data. Often these reasons will actually give you different response rates for different groups of data that you may identify, called "response propensity groups." If you group your data that way, and impute or reweight within groups, then you could obtain more 'representative' results. But whenever possible, I think regression (so-called "prediction") is very useful. It means that you can use data you already have, and the relationships between these data, to good advantage. Note that the individual standard errors for the prediction errors will somewhat 'cancel,' to a degree, so that the variance of the prediction error for an estimated total may be much better than one might expect.
So, for your continuous data, at least, some if the terminology above might, if you search the Internet, give you some ideas as to what might be the best way(s) for you to handle your missing data.
At any rate, I think that the fact that your data do not follow a Gaussian distribution is not the biggest problem. It can be a problem if - for example - you had a missing value that if observed would have been in the long tail of a skewed distribution, and you did not know it. However, if you are able to group your data in some way - based on characteristics of the data or perhaps response propensity - then this should help deal with just about any kind of problem you might have, including skewed data.
Thank you Sebastian. I looked up "NaN," and saw "division by zero," which made no sense to me in the context of Razieh's question. What you said works for me.
I'd only worked with SAS, SPSS, BMDP, FORTRAN, and JMP and EXCEL, if that counts, and had never seen "NaN" before (unless I forgot from limited, long ago use of SPSS and BMDP).
BTW, you may find it 'interesting' that in the official statistics/estimations I worked on for establishment surveys, when some survey designs and/or data collection systems made it difficult to tell a missing number from a reported zero, that was a big problem. A missing number may have been very large, if collected. I think people know that, but then get lost in the details and forget or get confused by other factors. One office even told me that they did not make imputations for missing numbers, apparently not realizing that yes, they were imputing, zeroes, and thus biasing every reported total downward. Seems hard to believe, but it happens. - I guess that was a corollary to the maxim that "Making no decision, is a decision" ... by default.