Hello,
I am applying a Logit Model on heart disease data (400k instances) which are imbalanced (90% negative and 10% positive classifier). Does anyone know how one has to proceed in this context? My approach is the following:
A. Doing logistic regression with the original imbalanced dataset
1. I split data into train and test data (80%,20%)
2. What do I have to do afterwards? Then fit a LR classifier on training data and making predictions on test set? Does it mean to make a prediction based on the training model and compare it with the test data --> which results in the Confusion Matrix (CM).
3. Based on this I calculate the Recall and Precision metrics
4. As performance measure I chose the Area under the Precision - Recall Curve.
--> This results in an AUC of under 0.5 which is worse then guessing!!
B. Applying a SMOTE (synthetic oversampling) AND random oversampling to correct the imbalance in the dataset
1. I split data into train and test data (80%,20%)
2. Then applying Random oversampling or SMOTE
3. Then again fit a LR classifier on training data and making predictions on test set? And Confusion Matrix (CM).
3. Based on this I calculate the Recall and Precision metrics
4. As performance measure I chose the Area under the Precision - Recall Curve.
Further Questions:
- Can I and if yes how can I apply threshold tuning in case A and B? Does it make sense in a balanced dataset in case B? Do I generate the best threshold value by applying the PR curve or the ROC?
-Do I calculate the Precision and Recall metrics after thershold tuning?
- Are Pseudo R2 necessary to be checked for the coefficients?
Thank you very much!!!