I need to know what will be the suitable approach to overcome the missing values in data mining. What are the possible ways to handle this situation with more data accuracy including removal of data anomalies.
Mainly two strategies: Ignoring any records that contain missing values, and finding out a replacement of missing values. Which one to apply? That depends on the data set and what to analyse. In brief, if you cannot ignore any data items although they contain missing you'd need to find replacement.
For missing values you can replace it by substitution of these values using the mean, min, max, the value most appears in the column or other statistician. This work depends on your data features. Or as well as say Daqing Chen ignoring the instance or record that contains this value, but this desition can be less profit for small datasets.
If the data is unstructured text data then better to create taxonomy and find appropriate missing value if the data is structured data then bin boundary, bin mean and many others
Quite a number of techniques are available to control the issue of missing values such as replacing the missing value with: (a) closest value, (b) mean value and (c) median value. Some algorithms are also used to deal with the problem of missing values such as k-nearest neighbor.
This is a data enhancement problem. Attempts to enhance data have a bearing on the results, thus, must be conducted with caution. Following is an algorithm for managing missing data:-
1. Eliminate all instances with inconsistent data and perform an analysis.
2. Choose a data filling method, such as the averaging technique, and apply to the original data then perform your analysis.
3. Compare results between Steps 1 and 2, if there is NO statistical difference between results, then report results. If there is statistical difference, then use more data enhancing techniques on original data, and perform analyses, to may be 3 sets of results. Then report results conditionally. This algorithm minimises errors.
The best techniques depends on characteristic and properties of your dataset. E.g. I recently published a paper on " Generic Data Imputation and Feature Extraction for Signals from Multifunctional Printers" http://ceur-ws.org/Vol-2322/dsi4-1.pdf that focus on IoT data.
Here are some options, some of which have been discussed above:
Option 1: Ignore samples with missing data
Option 2: Ignore variables with missing data
Option 3: Build a model based on samples without missing data and use the model to impute missing data in the samples with missing data
Option 4: Build a model with all samples while estimating the model parameters jointly with the missing data
Most application of these four options assume that the missing data were removed randomly (regardless of chosen option). There are however also cases where the appearance of missing data is caused by sensor signals that are out-of-range or measurements that are below detection limits or removed by other processes (e.g. remote sensor battery management, deadband/swinging door filter). For such cases, a specialized case-specific model is often required to handle missing data or to obtain optimal estimates of the missing data.
With option 3 and 4, one additionally has the challenge that the chosen model must be trusted. Since calibrating or selecting a model is often the primary objective of the data mining effort in the first place, such a trusted model can be hard to obtain at the time of imputation - this is why option 1 and 2 are often chosen still.
My comments are about time-series data prediction, so it may not apply for other research topics. Typical methods are listwise deletion (delete any observations that contain at least one missing value), replacement (mean, median, neighbors), and imputations. Many studies used replacement or imputations because they have to fill the blanks to proceed their investigations. If you have such a restriction, replacement (such as moving window average) or imputations (model-based) are recommended.
Alternatively, in case you don't want to give "new information", you could use a modified listwise deletion method that I developed. Basically, it is "listwise deletion" + "variable selection". See https://doi.org/10.1061/(ASCE)EE.1943-7870.0001097. if you're interested.