I am looking at a relatively large data set with numerical, nominal features. Unfortunately as is the case with such data, some of the entries are either missing or are invalid. Currently, I am following this approach
i. For Numerical Continuous Features : Find Mean/Median for each class and replace the invalid/missing value with the mean/median of that class.
ii. For Categorical/Nominal Data : Find Mode
I am not entirely sure what to do for test data. But I am thinking of following the approach suggested here (http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1)
Could someone comment on my approach and suggest a better approach.
Thanks!