I am looking at a relatively large data set with numerical, nominal features. Unfortunately as is the case with such data, some of the entries are either missing or are invalid. Currently, I am following this approach

i. For Numerical Continuous Features : Find Mean/Median for each class and replace the invalid/missing value with the mean/median of that class.

ii. For Categorical/Nominal Data : Find Mode

I am not entirely sure what to do for test data. But I am thinking of following the approach suggested here (http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1)

Could someone comment on my approach and suggest a better approach.

Thanks!

Similar questions and discussions