Feature selection for logit Regression with imbalanced dataset and non-linear association between predictors and outcome?

01 September 2023 2 5K Report

Hi everyone!

I am trying to find the "best" logit regression model given an a priori set of predictors (chosen from literature and the data available). My binary outcome is the fact of being associated with a certain production sector, and i would like to know which one of my predictors are the most explanatory of this outcome, knowing that i ultimately want to do a diff-in-diff or so to find the impact of such an association on the revenues of the firms i am looking at, the challenge being of creating a control group of firms not associated with such a sector but still comparable on most important metrics.

Initially, i have a tens of a priori selected categoric predictor variables, as well as numeric ones (i am studying agricultural firms so i have agrarian surface, livestock units etc). The challenges being that :

-my dataset for the logit model is very very imbalanced in favor of the group for which the outcome = 0 (i.e. firms that are not associated with the unit). Actually, i have 59000 of such firms vs 900 firms that are associated...

-inside each of my categorical predictor variables, i also have important imbalance between some levels

-having selected my features based on a stepAIC() procedure in R, and having re-grouped categorical variables so as to limit the imbalance (although for variable like sex of the director, i can't so the imbalance remains, while this may be an important predictor so i ideally want to keep it), i ended up with a model for which remaining predictor variables did not pass, for most, the "linearity" diagnostic (i.e. the condition of being linearly associated with the logit of the outcome), and log or polynomial transformation doesn't really change this association in the right way.

I also tried to undersample the majority group so as to have exactly the same number of firms associated with the unit, and not associated with it. Using the stepAIC() again, i end up with slightly different models, the accuracy (calculated thanks to a predict()) being of 68%, while the non-sampled model had an artificial accuracy of 98%... with a pseudo R^2 of respectively 0.1864 for the sample model and 0.11 for the non-sampled model.

in both models, i have a fair amount of significant predictors, but with very small estimates. Also, while some variables are significant whatever the model used, others change in their significance (or estimate's direction) depending on the choices made (for sampling, as well as for the AIC process according to whether i add a specific interaction).

given my ultimate goal, i am actually not reallly interested in predicting, but rather explaining the most significant predictors of my outcome, so I wonder if this is the right process (especially stepAIC()) to do so? and if so, if this is enough of a process to be able to say that my most significant predictors should be the variables that remained significant the whole time throughout all the models? Or are there inherent problems in my setup, given the non-passed diagnostic and the multiple imbalances? in this case what should i do?

Thanks in advance!

Inès François

Good afternoon,

I am conducting some logistic regression. To handle animbalanced dataset, I did a cross-validation K-fold usually 10 folders, with R you can even repeat 10-fold cross-validations (increase the processing time of the algorithm), then, I created a data partition between the train and test dataset. The outcome of logistic regressions calculating odd ratios can be interpretable but you can conduct a dominance analysis to know which factors explain more the outcome.

Qamar Ul Islam

Camille Perault You're dealing with a complex situation that includes imbalanced data, non-linear associations, and the need for explanatory rather than predictive modeling. Here are some suggestions to address your challenges:

1. Feature Selection: Since you want to explain the most significant predictors, consider alternative feature selection methods tailored for logistic regression with non-linear associations. You can try:

- Lasso Regression: L1 regularization can help select the most important predictors while addressing non-linearity issues.

- Tree-Based Methods: Decision trees and random forests can capture non-linear associations effectively and identify important predictors.

- Mutual Information: This metric can capture dependencies between variables, including non-linear relationships.

2. Addressing Imbalanced Data:

- Oversampling: Instead of undersampling, consider oversampling the minority class using techniques like Synthetic Minority Over-sampling Technique (SMOTE) or Adaptive Synthetic Sampling (ADASYN).

- Weighted Loss: Assign different weights to the classes in your logistic regression model to account for class imbalance.

3. Diagnostic for Non-Linearity: If your remaining predictors do not pass the linearity diagnostic, it suggests a violation of the logistic regression assumption. You can explore:

- Non-linear Transformations: Try non-linear transformations of the predictors (e.g., polynomial terms, splines) to capture non-linear effects.

- Generalized Additive Models (GAMs): GAMs can model non-linear relationships effectively and can be applied in situations where linearity assumptions are violated.

4. Interactions: Interactions can be important in explaining complex relationships. However, be cautious with adding too many interactions, as they can lead to overfitting. Use domain knowledge to guide interaction selection.

5. Cross-Validation: Employ cross-validation techniques, such as k-fold cross-validation, to assess the generalization performance of your models and ensure their robustness.

6 Domain Expertise: Consult with domain experts in agricultural economics or related fields to help identify key predictors and validate the model's findings.

7. Interpretability: Consider using models that offer better interpretability than logistic regression, such as decision trees or rule-based models. This can help explain the relationships between predictors and outcomes more intuitively.

8. Model Evaluation: Evaluate your models based on metrics relevant to your explanatory goals, such as p-values, effect sizes, and the direction of associations.

Remember that explanatory modeling may not always result in high predictive accuracy, but it can provide valuable insights into the relationships between variables. Be cautious about overfitting and prioritize model interpretability for your research objectives.

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

"A Markov-like Model for Patient Progression"?

How to develop an academic literacy program for engineering at the higher education level?

I need the datasets of Microgrid for system identification?

How combine yolo with Faster R-CNN?

How to Compress Information Neurally?

What exactly is RAG-LLM doing? Isn’t it data engineering?

May I know the exact Quartile of the journal- Advanced Engineering Materials (Wiley) for material science category?

I need the housing or real estate prices data since 1950 till date in India. Can anyone guide how can I get this data ?

Which file formats are accepted for supplementary material?