01 January 1970 3 6K Report

Dear Colleagues I hope all is well. I have a classification project, the event outcome is imbalanced (25% of the subjects having the event) The data has almost 90 variables that become 250 after doing one hot encoding I tried oversampling technique and the accuracy, sensitivity are really very well in the oversampled training data (just excellent, with accuracy up to 95%). However, it is very bad, on the validation set (almost 25%). This happens when using Logistic regression, random Forest, decision tree, XGBOOST, gradient boost and Bagging. I wonder if this would be related to the big number of features (250 features?) Shall I do Recursive feature elimination Random Forest before running all these models on the oversampled data? Would this make a difference? Or Recursive feature elimination Random Forest is only used at the end to get a simplified prediction model?

More Hatem Ali's questions See All
Similar questions and discussions