Values of some parameters for some objects are missing (blank). How to use (transform) SVM classification method for such Data Sets? Please, give me advises, links, references, articles, etc. Thanks beforehand. Regards, Sergey.
Usually for "missing data" we mean, that for some objects there are not labels. For my task situation is other - labels exist for all objects from training data set, but for some objects there are not exist values of some parameters.
I'm afraid that applying semi-supervised learning is not useful (at least at this stage) as it requires building a prediction model from the full data that have NO missing value (while here the data have missing values) and use this resulting model to predict the unlabeled data (post stage--not the current situation). In fact, Sergey needs to preprocess the data before building prediction model, so he needs to handle missing values at the very early stage.
I agree with your idea that semi-supervised learning is not a proper tool in the preprocessing steps. While if there are some data has label and Sergey does not totally throw away the unlabeled data, maybe semi-supervised learning can help
Dear all, thanks for your answers. Certainly, label missing isn't problem for me - all objects have labels. Problem is rather missing of the parameter values. I will carefully study your proposals.
Dear Samer Sarsam, once more thanks for your answer ( "If the missing value is a numeric one, replace it with the mean of the associated attribute. Otherwise, replace it with the mode (if it is nominal"). In my opinion, it isn't fully correct. Some parameters may be strong correlated, and in this case to insert missing value of some parameter we should take into account values of other (not missing!) parameters and values of correlations. Perhaps, you know some articles, which consider this approach ? Thanks beforehand. Regards, Sergey.
There is no universal perfect approach for finding hidden stuff (whatever they are), even the prediction model (with its high complexity, sometimes) that you need to build has a percentage error.
In regards to handling missing values, knowing your data is a crucial fact that you need to consider at the very early stage. The strategy I suggested is old and common one, where several resources have discussed it in the literature. For example, kindly consult the book "Data mining and predictive analytics- second edition" by Larose (2015).
On the other hand, generally, if you have a group of subjects, the dataset of each is a combination of {(instances/examples) and (attributes/features)}. In order to build a classifier model from all subjects' datasets, such datasets have to be compatible (same instance/attribute characteristics). Nevertheless, if you have missing *instances*, in some of them, you can use the strategy I suggested. But, if subject's dataset has missing *attributes*, then you can:
- either remove all these attributes from all subjects, or
- build a predictive model from all the subjects that have full attributes and use this model to predict the missing attribute(s) in each participant's dataset.
I also see following approach - step-by-step to prune (to delete) some parameters, which can have missing values, and to solve SVM task for limited amount of input parameters. Drawback is following - we should solve a few SVM tasks. e.g.. for full amount of variables, without parameter 3, without parameter 7, etc.