To investigate and to predict a binary variable, do you use Logistic Regressions or Gradient Boosting Models?

Dear Inès,

I choose between Binary Logistic Regression and Gradient Boosting Decision Tree models depending on the specific problem, data characteristics, and the goals of the analysis, as each method has its strengths and weaknesses:

Binary Logistic Regression:

Strengths:

a. Simple and interpretable model: Logistic regression provides a linear relationship between the log-odds of the binary outcome and the input features, making it easy to understand the effect of individual features.

b. Works well with linearly separable data.

c. Provides probabilities for the binary outcome, which can be useful for understanding confidence in predictions or for ranking predictions.

d. Fast training and prediction times.

Weaknesses:

a. Limited to linear relationships between features and the log-odds of the binary outcome. It may not perform well on complex, non-linear relationships between features and the target variable.

b. Assumes independence between features, which may not always be true in real-world scenarios.

Gradient Boosting Decision Tree:

Strengths:

a. Can model complex, non-linear relationships between features and the target variable.

b. Can handle interactions between features automatically.

c. Typically has higher predictive accuracy than logistic regression for a wide range of problems.

d. Can be tuned with various hyperparameters to achieve the best possible performance.

Weaknesses:

a. Can be more computationally intensive and take longer to train than logistic regression, especially with large datasets and many features.

b. Interpretability can be more challenging, as GBDT models are an ensemble of decision trees, making it harder to understand the contribution of individual features.

c. Prone to overfitting if not properly tuned or regularized.

In general, if the primary goal is interpretability and the relationship between the features and the binary outcome is expected to be relatively simple and linear, binary logistic regression may be the better choice. However, if the primary goal is predictive accuracy, and the relationships between the features and the binary outcome are expected to be more complex and non-linear, GBDT models are usually a better choice.

In practice, it is often beneficial to try both models, and possibly others, to compare their performance on your specific problem and data. Additionally, you may consider using techniques such as LASSO or Ridge regularization with logistic regression or feature importance analysis with GBDT models to improve interpretability and model performance.

Good luck with your research,

Wadie

Unexpected Increase in R² for the Third Component in sPLS model?

Seeking Expertise on Polymer Stabilization Agents ?

Can i keep distinct assembly parts after running topology optimization in Abaqus?

Can you identify what is this in my cell culture?

Does there exist a compact (Hausdorff) topological space X such that if F is a connected closed subset of X, either F=X or F={x} for some x in X?

How to merge datasets in R by row including only matching records?

How do I map and study changes to river morphology (planforms) through time?

Characterization of Basic Mineral Solution (pH=14)?

Does anyone know the best conditions to electrospin high Mw PVA in water?

Anyone on arXiv.org ready to endorse my preprint?

Is there an alternative to a multinomial regression which allows the DV to be non mutually exclusive?

In order to run Multinomial Logistic Regression, is it required that the data be in the long format?

Normality assumption for linear regression is The assumption of normality is whether for residual errors or predictor variavble?

Is it redundant to use both Random Forest and Decision Tree algorithms in the same regression project?

If in a panel data, T>N then which model is appropriate ?

What are the problems we face when we directly inverse a multivariate regression equation?

How to interpret a Low R squared and negativ adj. R on my fixed effects panel analysis?

How can I merge two power regression curves into a single graph in SPSS 25?

The 95% confidence interval, which ranges from 0.0 to 1.2, prompts the question: Is the regression coefficient significantly different from zero?

Hello every one , i have a question about Using Design expert software , according to what i choose configuration ?