A number of practical problems have missing data in the datasets. These missing data are sometimes indispensable for solving problems. Therefore, people cannot simply ignore these missing data in datasets. A naive way for dealing with missing values is to fill them with a constant or a mean of its class. A more precise way for coping with missing values is using prediction, such as regression or classification. In addition, missing values do not mean that the data are wrong. For example, in some application form, applicants are often demanded to put their occupations onto the table. However, if the applicant is jobless, data on the blank of job will be a missing value.
Joachim's answer is correct. Moreover, for a time-series problem, you may use Hidden Markov Model to predict missing values. All the data including the predicted missing values can be trained by neural networks in the next step.
Joachim's answer about letting the neural network predict the missing values is the most interesting answer here.
The way that would work is that you essentially run the neural network in reverse. This is a straightforward optimization problem similar to the way that networks can be interrogated for most likely inputs for particularly outputs except that many inputs are constrained. When finding plausible inputs, the problem is highly multi-modal but when estimating the value of a few inputs, the problem should be better behaved.
This same technique can be used when being applied to novel data except now the output is also unconstrained.
Maybe you can use the last known value from the series. (In case of stock trading, if you obtain close prices of every minute, it can happen that there wasn't any trade in the particular minute, so there isn't obtainable close price. In this case you may use the last known close price instead.)
Theoretically, missing data is a loss of information. Hovever, it depends on the data itself. For example, even if some data is missing, if the hypotesis of Nyquist-Shannon's theorem are met, we have no problem in training the net. Vice versa, for example, if you have to learn a hidden y=f(x) and you have the samples only inside the interval [0,1] plus [2,3], it is possible (it depends on the f(x)) that the true values of the f(x) inside [1,2] are not derivable from the previous sample absolutely.
The true is that the question you arised cannot be answered if you don't give more info on the data set. Perhaps, only a deep study of your available data can give some suggestions.
Don't think to subdivide the data in learning, crossover and testing data to attempt to resolve your problem. In the second case I described I could have very good errors in all the three pattern sets, but the problems with the points in [1,2] remain.
A simple feedforward network can't really be used without additional assumptions.
Such a network can be augmented with additional information to allow missing input elements by viewing it as the mode of a posterior estimate of the joint distribution of the inputs and target. A few assumptions of multi-dimensionality and independence structures later and you can derive the conditional distribution of the missing input.
Without the assumption of normality, you can assume that the network is giving you a value that is proportional to the conditional probability of the target variable given the predictor variables. The normalization constant is not known, but it is constant.
Given this, if you have the target variable, then you can also simply manipulate the missing input values to maximize the score for the known target.
If both target and some predictor variables are missing, then you can manipulate the missing predictor variables to give the highest value of the maximum output score for any option. That becomes your prediction.
Another option if you have missing predictor and target is that the output is not a single prediction of the target but is instead a distribution that gives you the target conditional on the actual value of the missing predictor. To get this, you sample values of the missing predictor, possibly biased toward distinctive target results and use the values of the missing predictor and the resulting target as training data for a new network which is returned as the desired result.
For training with missing predictor values, you can simply use back propagation to derive values for the missing predictors in the same way that you use back propagation to define weights. As such, you can view the topology of the network as variable ... when a missing value is encountered, the topology is changed to have a structure with a constant and a one-off weight. That weight is actually the desired missing value.
Since the missing values will be trained rarely in any gradient descent, it is often desirable to provide additional epochs specifically for training the missing values themselves.
I think you are being quite unnecessarily pessimistic about these approaches.
Regarding the Gaussian treatment of the posterior of a neural network, see http://www.inference.phy.cam.ac.uk/mackay/BayesNets.html
Regarding backpropagation, yes, the network will need some training on other inputs to be useful for doing backpropagation to missing inputs. That this works well when you have mostly non-missing values is hardly controversial ... indeed it is completely standard practice. Refer to all of the cases of image completion in the literature or the use of optimization to find most plausible inputs to explain various output cells (see http://research.google.com/archive/unsupervised_icml2012.html ) for several examples. These work by assuming that *all* inputs are missing so the technique clearly works.
Only the Bayesian approach is dependent on strong assumptions. The other approaches only assume that the non-missing data is strong enough to mostly train the network. This isn't so far from the assumptions that underly the use of neural networks in the first place.
Finally, the idea of returning a function as the result given partial inputs is hardly novel. It is simply a neural approach to function currying which is very well known ( https://en.wikipedia.org/wiki/Currying ). The idea of using one model to provide training data for another is also hardly novel. This is often done, for instance, to derive a simpler model using an unregularized training algorithm from a more complex model which was trained using good regularization. Thus, a neural network can be trained using simple techniques to mimic the behavior of a random forest. By using the random forest to synthesize a very large amount of training data, the putative lack of regularization in the training for the neural network is moot ... it doesn't learn to over-fit the original data; it learns to fit the regularized and desirable behavior of the random forest.
Again, none of these are at all revolutionary. At most they might be unfamiliar to some which is precisely why I mentioned them in the first place.
Time series missing data imputation is an active field of research. There are many possible strategies in order to preprocess efficiently the data. First of all I tend to suggest rolling means in order to locally impute single missing values. Another possible strategy is to impute the data using interpolation. So in this sense you can impute more than 2-3 values by considering some trend which fit well the data. At the same time you can to estimate in some the cycle of your data and then impute the missing values (usually more than 2-3 values but reasonably not more). Another possible approach could be starting from the existing observations and try to model the data using different approaches (ARIMA, seasonal ARIMA and so on). In these way you can predict the missing values. It is important to note that you need to be careful about structural changes so an idea it is impute data by considering rolling windows. Finally if you use your neural network for forecasting you can combine different predictions obtained using different imputation methods (the most simple way to do is using an average of the predictions). In this way you can minimize the impact of using some imputation method eventually not appropriated.
1/ you draw a value on your data set. The advantage in comparison to the replacement by the mean value is that you draw the value in the estimated distribution of values.
2/ you search the k nearest neighbours using all dimension except the missing one, and you draw a value on the k nearest neighbours. The advantage in comparison to the previous method is that you take into account the correlations between dimensions. The drawback is the higher computational time cost. Notice that it exists efficient algorithms to approximate the k-NN. See for instance:
I agree with friends, Ignore the missing value during data analysis is not good idea even if a neural network can do it. In time series prediction problems, ignore the missing value is extremely bad idea so I absolutely agree with Dr. Joachim Pimiskern in the way for resolve your problem: - let the neural net predict the missing values and feed them back.
How can i trust my interpolated data / the predicted data from my net. I am tackling a system for which a particular output is sampled irregularly.sometimes i have missing data for nearly some 100 samples .I don't have continuous data to even train my network.I interpolated using in-painting technique(used for filling images) for time series data. Do i trust my interpolated data ?
Whatever kind of data you will introduce (see the suggestions other colleagues have given to you) instead of the lost data, remember that they will deeply and dramatically influence the ANN forward behaviour. So, take care of it.
You can use dataset with missing data, but you need to know that prediction in that range will not be reliable. It depends on application if you can interpolate data for missing range.
for the sake of testing various solutions suggested above, I want to create a time series dataset with missing values either random or interval. Would you please introduce some works in literature that explain how to create datasets of this type in the field of machine learning?
The result will depend on cross-dependency between missing and available elements of input vectors and/or the importance of missing ones.
If the inputs are dependent, there is redundancy of information, so the missing parts can be deduced in many ways. In such case several techniques proposed above will bring quite good results. E.g. I like the one with triggers/valves for each input ;). Whilst if the inputs are independent, then any trick applied to deduce the missing ones will fail.
If the missing parts have strong impact on neural network output and they are not deduced properly, then the errors will be huge. If the missing parts have minor influence then it will be OK, regardless of efficiency of the applied method of deduction.
To relevant issues ... might be tricky to formulate any general conclusions and estimate efficiency of applied methods after an experiment performed on specific data.
Sensitivity analysis applied to neural network can be used to estimate the importance of particular input.
It might be also helpful to check the cross-dependency, e.g. by training some neural networks predicting one inputs from the other ones (somewhat suggested above, but I mean independent experiment, performed before the real training), or by studying the mixed derivatives in sensitivity analysis, again.
Missing data may have different characteristics than existing data. Thus, training with missing data neglect this situation and consequently the rows with missing data should be discarded or they will cause the increasing in the impact of overwriting and underfitting problem with random weight assignment for the row with missing data
There are several packages in R (like mice) which can impute your missing data. You can use them to impute the missing data and then do the neural network.
I believe that firstly you need to get the rid of missing data using different techniques since if not, then it is meaningless to multiply a "Nan" Input with a weight and add a bias, do a back propagation ,etc.
The output layer will have "Nan" s and you will end up with a wrong classification. So a net trained by a neural network WITH missing data is not valid.
Consider a transformer framework for tabular data, tabnet, mask the missing value and do the MLM task to predict the missing value.
A similar idea is Cui and Fen 2021. They use a transformer and train the numeric data with missing values and in the inference time, the model predict the missing values. Their experiment show a better RMSE than hot-decking or kNNs.
Cui, Zhengyuan, and Cameron Fen. 2021. “Using Transformers to Impute Missing Data.”