Values of some parameters for some objects are missing (blank). How to use (transform) SVM classification method for such Data Sets? Please, give me advises, links, references, articles, etc. Thanks beforehand. Regards, Sergey.
Before applying any machine learning algorithm on your dataset, it is fundamental to understand and preprocess your data. In case that your data has missing values, you can perform the several techniques based on Han, Kamber, & Pei's book on Data Mining Concepts and Techniques. It would be helpful if you read the entire book but if you are in a hurry, the answers you're looking for can be found on pp. 88-111.
Usually for "missing data" we mean, that for some objects there are not labels. For my task situation is other - labels exist for all objects from training data set, but for some objects there are not exist values of some parameters.
I'm afraid that applying semi-supervised learning is not useful (at least at this stage) as it requires building a prediction model from the full data that have NO missing value (while here the data have missing values) and use this resulting model to predict the unlabeled data (post stage--not the current situation). In fact, Sergey needs to preprocess the data before building prediction model, so he needs to handle missing values at the very early stage.
I agree with your idea that semi-supervised learning is not a proper tool in the preprocessing steps. While if there are some data has label and Sergey does not totally throw away the unlabeled data, maybe semi-supervised learning can help
Dear all, thanks for your answers. Certainly, label missing isn't problem for me - all objects have labels. Problem is rather missing of the parameter values. I will carefully study your proposals.
Dear Samer Sarsam, once more thanks for your answer ( "If the missing value is a numeric one, replace it with the mean of the associated attribute. Otherwise, replace it with the mode (if it is nominal"). In my opinion, it isn't fully correct. Some parameters may be strong correlated, and in this case to insert missing value of some parameter we should take into account values of other (not missing!) parameters and values of correlations. Perhaps, you know some articles, which consider this approach ? Thanks beforehand. Regards, Sergey.
There is no universal perfect approach for finding hidden stuff (whatever they are), even the prediction model (with its high complexity, sometimes) that you need to build has a percentage error.
In regards to handling missing values, knowing your data is a crucial fact that you need to consider at the very early stage. The strategy I suggested is old and common one, where several resources have discussed it in the literature. For example, kindly consult the book "Data mining and predictive analytics- second edition" by Larose (2015).
On the other hand, generally, if you have a group of subjects, the dataset of each is a combination of {(instances/examples) and (attributes/features)}. In order to build a classifier model from all subjects' datasets, such datasets have to be compatible (same instance/attribute characteristics). Nevertheless, if you have missing *instances*, in some of them, you can use the strategy I suggested. But, if subject's dataset has missing *attributes*, then you can:
- either remove all these attributes from all subjects, or
- build a predictive model from all the subjects that have full attributes and use this model to predict the missing attribute(s) in each participant's dataset.
I also see following approach - step-by-step to prune (to delete) some parameters, which can have missing values, and to solve SVM task for limited amount of input parameters. Drawback is following - we should solve a few SVM tasks. e.g.. for full amount of variables, without parameter 3, without parameter 7, etc.
https://research.fhcrc.org/content/dam/stripe/wu/files/Publications/2018wires-svm.pdf is an interesting discussion of missing data and its consequences and handling in SVM. One of their suggestions (based on http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.535.2616&rep=rep1&type=pdf): Find the k nearest neighbours of the vector with missing data, excluding the offending parameter. Then calculate the mean of that parameter for these k values and impute that. This should work particularly well for MAR data.
Imputation of means or medians is appropriate for MCAR data, but the KNN approach may work here too.
MNAR data are of course particularly worrisome. For example, in a study on medical student performance 1/3 of the cases had been dismissed for poor grades during the course, so that final grades (USMLE Step 2) were unavailable. Using only the complete cases would have biased the data, and imputing means was obviously inappropriate too: These students had been dismissed precisely because their performance was well below mean. I calculated the standard deviation from mean of their last available grade and used that to calculate reasonable estimates for their likely performance in later exams. That seems to have worked reasonably well.