For inference and predictive models with a binary variable, do you prefer to use Binary Logistic Regressions Models or Gradient Boosting Decision Tree Models, and why?
It depends. Regression will do if you investigate influence of predictors on outcome and if there are not non-linearities, heterogeneity of effects (interactions), and not many predictor variables. If you wish to predict you do tree, eg. boosting.
I choose between Binary Logistic Regression and Gradient Boosting Decision Tree models depending on the specific problem, data characteristics, and the goals of the analysis, as each method has its strengths and weaknesses:
Binary Logistic Regression:
Strengths:
a. Simple and interpretable model: Logistic regression provides a linear relationship between the log-odds of the binary outcome and the input features, making it easy to understand the effect of individual features.
b. Works well with linearly separable data.
c. Provides probabilities for the binary outcome, which can be useful for understanding confidence in predictions or for ranking predictions.
d. Fast training and prediction times.
Weaknesses:
a. Limited to linear relationships between features and the log-odds of the binary outcome. It may not perform well on complex, non-linear relationships between features and the target variable.
b. Assumes independence between features, which may not always be true in real-world scenarios.
Gradient Boosting Decision Tree:
Strengths:
a. Can model complex, non-linear relationships between features and the target variable.
b. Can handle interactions between features automatically.
c. Typically has higher predictive accuracy than logistic regression for a wide range of problems.
d. Can be tuned with various hyperparameters to achieve the best possible performance.
Weaknesses:
a. Can be more computationally intensive and take longer to train than logistic regression, especially with large datasets and many features.
b. Interpretability can be more challenging, as GBDT models are an ensemble of decision trees, making it harder to understand the contribution of individual features.
c. Prone to overfitting if not properly tuned or regularized.
In general, if the primary goal is interpretability and the relationship between the features and the binary outcome is expected to be relatively simple and linear, binary logistic regression may be the better choice. However, if the primary goal is predictive accuracy, and the relationships between the features and the binary outcome are expected to be more complex and non-linear, GBDT models are usually a better choice.
In practice, it is often beneficial to try both models, and possibly others, to compare their performance on your specific problem and data. Additionally, you may consider using techniques such as LASSO or Ridge regularization with logistic regression or feature importance analysis with GBDT models to improve interpretability and model performance.