Hi everyone!
I am trying to find the "best" logit regression model given an a priori set of predictors (chosen from literature and the data available). My binary outcome is the fact of being associated with a certain production sector, and i would like to know which one of my predictors are the most explanatory of this outcome, knowing that i ultimately want to do a diff-in-diff or so to find the impact of such an association on the revenues of the firms i am looking at, the challenge being of creating a control group of firms not associated with such a sector but still comparable on most important metrics.
Initially, i have a tens of a priori selected categoric predictor variables, as well as numeric ones (i am studying agricultural firms so i have agrarian surface, livestock units etc). The challenges being that :
-my dataset for the logit model is very very imbalanced in favor of the group for which the outcome = 0 (i.e. firms that are not associated with the unit). Actually, i have 59000 of such firms vs 900 firms that are associated...
-inside each of my categorical predictor variables, i also have important imbalance between some levels
-having selected my features based on a stepAIC() procedure in R, and having re-grouped categorical variables so as to limit the imbalance (although for variable like sex of the director, i can't so the imbalance remains, while this may be an important predictor so i ideally want to keep it), i ended up with a model for which remaining predictor variables did not pass, for most, the "linearity" diagnostic (i.e. the condition of being linearly associated with the logit of the outcome), and log or polynomial transformation doesn't really change this association in the right way.
I also tried to undersample the majority group so as to have exactly the same number of firms associated with the unit, and not associated with it. Using the stepAIC() again, i end up with slightly different models, the accuracy (calculated thanks to a predict()) being of 68%, while the non-sampled model had an artificial accuracy of 98%... with a pseudo R^2 of respectively 0.1864 for the sample model and 0.11 for the non-sampled model.
in both models, i have a fair amount of significant predictors, but with very small estimates. Also, while some variables are significant whatever the model used, others change in their significance (or estimate's direction) depending on the choices made (for sampling, as well as for the AIC process according to whether i add a specific interaction).
given my ultimate goal, i am actually not reallly interested in predicting, but rather explaining the most significant predictors of my outcome, so I wonder if this is the right process (especially stepAIC()) to do so? and if so, if this is enough of a process to be able to say that my most significant predictors should be the variables that remained significant the whole time throughout all the models? Or are there inherent problems in my setup, given the non-passed diagnostic and the multiple imbalances? in this case what should i do?
Thanks in advance!