Hi there,

I have over 1 million records, with a categorical response variable and a mix of categorical and continuous predictors. There are several missing values in the dataset, this is represented by a zero. This is just a trial; my final dataset will have at least 10 more predictors. I have attached a copy of what the data looks like...

Im using python to run random forest model to determine which variables that best predict y. I have 2 questions:

1.When I try to train the model using the training sets (y_pred=clf.predict(X_test), I usually get an error stating it cannot convert a str to a float. I have tried to use the LabelEncoder to help with transforming the values but it gives me a bad input shape error, I then tried to correct this with no luck. So, is a Random Forest appropriate when we have large datasets, several predictors and both continuous and categorical predictors? If so, any advice regarding data structure for RF?

2.In terms of missing values/NA’s, which are represented by a zero, im concerned about what RF does with these. If it takes a zero value, which is very close to other values such as (0.5) does the model then consider these as a value of zero or a missing value. There are many forums suggesting changing these NA’s to the median of that column or a value that is out of range so the model might exclude it. I’d prefer excluding NA’s rather than adding a value. Any suggestions on a way forward?

Thanks in advance!

More Lavinia Perumal's questions See All
Similar questions and discussions