Hello,

I am applying a Logit Model on heart disease data (400k instances) which are imbalanced (90% negative and 10% positive classifier). Does anyone know how one has to proceed in this context? My approach is the following:

A. Doing logistic regression with the original imbalanced dataset

1. I split data into train and test data (80%,20%)

2. What do I have to do afterwards? Then fit a LR classifier on training data and making predictions on test set? Does it mean to make a prediction based on the training model and compare it with the test data --> which results in the Confusion Matrix (CM).

3. Based on this I calculate the Recall and Precision metrics

4. As performance measure I chose the Area under the Precision - Recall Curve.

--> This results in an AUC of under 0.5 which is worse then guessing!!

B. Applying a SMOTE (synthetic oversampling) AND random oversampling to correct the imbalance in the dataset

1. I split data into train and test data (80%,20%)

2. Then applying Random oversampling or SMOTE

3. Then again fit a LR classifier on training data and making predictions on test set? And Confusion Matrix (CM).

3. Based on this I calculate the Recall and Precision metrics

4. As performance measure I chose the Area under the Precision - Recall Curve.

Further Questions:

- Can I and if yes how can I apply threshold tuning in case A and B? Does it make sense in a balanced dataset in case B? Do I generate the best threshold value by applying the PR curve or the ROC?

-Do I calculate the Precision and Recall metrics after thershold tuning?

- Are Pseudo R2 necessary to be checked for the coefficients?

Thank you very much!!!

More Philipp Mackert's questions See All
Similar questions and discussions