I have a relatively small dataset which has missing data on a categorical variable (0 or 1). What is the best way of handling this problem withou excluding the data from the dataset?
For example, you might want to model data. Practacally, it means that you fulfil the gaps "at your hands" while the inserted data must be -- in some sense -- the closetes to the subset representing the avalaible part of the original dataset (which is with gaps).
Missing data is an important topic in Mathematical Statistics. The Wikipedia page gives a reasonable overview https://en.wikipedia.org/wiki/Missing_data . Note it is important to know why the particular values are missing before considering what technique is best.
When we confront to missing data in a dataset, we have two General approaches, remove it from the dataset or fill it with words like “unknown” or “null”.
Still, you should provide a procedure to treat it. Especially, if a base is, say, the table of various genes expression levels. Otherwise, such fulfilment is just another (and even not opportunistic) way to code the lacunae in data.
We don't recommend removing missing values, but instead of that you can adopt the classical way for handling missing values through replacing the nominal and the numeric attributes in a dataset (depends on your missing data type) with the modes and the means from the training data (data with missing values, here).
Another option is to use advanced methods such as simple nearest neighbor algorithm. This *unsupervised* mechanism that wraps several machine learning algorithms. Here, I introduced one of the utilized algorithms, which is the "simple nearest neighbor" because it showed the best results during testing stage. This approach is completely different from the supervised learning approach (i.e., classification) by the nearest neighbor classifier. The suggested approach uses the specified nearest neighbor search to determine the neighborhood from which it uses:
- the most common label (nominal attributes).
- the average in the neighborhood (numeric/date attributes) to replace missing values for the instance currently being processed.