Dear all respectful researchers,
I am working on a structured biomedical dataset that consists of many data type inconsistency, outliers and missing values (instances) on seven independent variables (attributes). I am considering to perform pre-processing methods such as data standardization and also imputations to improve the issues mentioned above. However, there are two version of the pre-processing methods, that is, supervised and unsupervised ones.
My main two questions regarding the common practice are:
1. Should I perform unsupervised discretisation method on the dataset to solve data type issue when, subsequently, I conduct cluster analysis using k-means cluster algorithm?
2. After completing the first clustering task above, should I perform supervised discretisation method on the same dataset when I train the model for classification task using supervised machine learning algorithms?
Thank you for your information and experience.