Hello colleagues,
I am planning to run a decision tree model in order to identify profiles that are associated with the binary success variable in a hackathlon. Unfortunately, my colleagues who ran the hackathlon have collected up to 200 predictors--way too many to include them in a tree model (where I would guess only 4-5 are useful, correct?).
My hunch would be to first run a logistic LASSO to identify these highly important predictors that then go into the tree model.
I am aware that a variable may turn out as relevant in the tree model that have no main relationship that could become salient in the LASSO but as this is a exploratory study and I have to come up with a practical approach, I would neverthess assume that the LASSO can identify the relevant ones.
A more finegrained issue is whether there are problems to first run the LASSO on the training data (10fold CV) and then run the tree on the same training data. At the end, both have to face the test set, where the dues are paid :) I have only begun to run machine learning models for the case I don't see the obvious :)
I would appreciate any comments or alternative recommendations
All the best,
Holger