Logit approach with imbalanced data - Precision, Recall, Threshold tuning, SMOTE

01 January 1970 1 6K Report

Hello,

I am applying a Logit Model on heart disease data (400k instances) which are imbalanced (90% negative and 10% positive classifier). Does anyone know how one has to proceed in this context? My approach is the following:

A. Doing logistic regression with the original imbalanced dataset

1. I split data into train and test data (80%,20%)

2. What do I have to do afterwards? Then fit a LR classifier on training data and making predictions on test set? Does it mean to make a prediction based on the training model and compare it with the test data --> which results in the Confusion Matrix (CM).

3. Based on this I calculate the Recall and Precision metrics

4. As performance measure I chose the Area under the Precision - Recall Curve.

--> This results in an AUC of under 0.5 which is worse then guessing!!

B. Applying a SMOTE (synthetic oversampling) AND random oversampling to correct the imbalance in the dataset

1. I split data into train and test data (80%,20%)

2. Then applying Random oversampling or SMOTE

3. Then again fit a LR classifier on training data and making predictions on test set? And Confusion Matrix (CM).

3. Based on this I calculate the Recall and Precision metrics

4. As performance measure I chose the Area under the Precision - Recall Curve.

Further Questions:

- Can I and if yes how can I apply threshold tuning in case A and B? Does it make sense in a balanced dataset in case B? Do I generate the best threshold value by applying the PR curve or the ROC?

-Do I calculate the Precision and Recall metrics after thershold tuning?

- Are Pseudo R2 necessary to be checked for the coefficients?

Thank you very much!!!

Erick Kiptoo

When dealing with an imbalanced dataset in the context of applying a logistic regression (Logit) model on heart disease data, there are several steps you can take to address the imbalance and evaluate the performance of your model. Let's go through your approach and address your questions:

Approach A: Logistic regression with the original imbalanced dataset

1. Splitting data: Splitting the data into training and test sets is a good practice to evaluate your model's performance.

2. Fitting the model and making predictions: Fit a logistic regression classifier on the training data and use the trained model to make predictions on the test set.

3. Confusion matrix and metrics: Calculate the confusion matrix (CM) using the predicted values from step 2 and the actual values in the test set. From the confusion matrix, you can compute metrics such as recall and precision.

4. Performance measure: You mentioned choosing the Area Under the Precision-Recall Curve (AUC-PR) as your performance measure. It captures the trade-off between precision and recall, which is relevant for imbalanced datasets.

Approach B: Applying SMOTE or random oversampling to address imbalance

1. Splitting data: Similar to approach A, split the data into training and test sets.

2. Oversampling: Apply either SMOTE or random oversampling techniques to balance the dataset. These techniques generate synthetic or random samples to increase the representation of the minority class.

3. Fitting the model and making predictions: Fit a logistic regression classifier on the balanced training data and make predictions on the test set.

4. Confusion matrix and metrics: Calculate the confusion matrix and evaluate metrics such as recall and precision using the predicted values and the true values in the test set.

5. Performance measure: Continue using the AUC-PR as your performance measure to assess the model's performance.

Now let's address your additional questions:

- Threshold tuning: Threshold tuning is applicable in both cases A and B. You can optimize the threshold value to balance precision and recall based on your specific requirements. You can use metrics like the precision-recall curve or the receiver operating characteristic (ROC) curve to determine the best threshold value.

- Precision and recall metrics: Yes, you can calculate precision and recall metrics after threshold tuning to assess the performance of your model at the chosen threshold value.

- Pseudo R2: Pseudo R2 measures, such as McFadden's R2 or Nagelkerke's R2, are often used to assess the goodness-of-fit of logistic regression models. They provide information about how well the model explains the variation in the data. Checking these coefficients can be helpful in understanding the overall fit of the model.

Remember that addressing class imbalance is crucial, and oversampling techniques like SMOTE or random oversampling can help improve the performance of your model on the minority class. Additionally, evaluating multiple metrics and considering the specific requirements of your problem is important for a comprehensive assessment of your model's performance.

If you have further questions or need more specific guidance, please don't hesitate to ask.

Badges
Science topic

Similar topics
Engineering
Modeling

More Philipp Mackert's questions See All

Can we mark 'EFL Learners shifting from general digital to AI technologies' as technological transition?

After COVID-19 it has seen that EFL learners technological affiliation has raised. In addition, in the post-COVID period learners started to engage AI technologies like ChatGPT while learning...

08 August 2024 8,964 4 View

How to generate a citation of my paper from ResearchGate?

How we can cite the papers from ResearchGate. I am trying to create citations for this article, Quantum Machine Learning Algorithms for Optimization Problems: Theory, Implementation, and...

08 August 2024 6,690 3 View

Does Anyone have expertise in in vitro transcription and RNA pull down assay?

I am currently working on LncRNA; to know the lncRNA-protein interactions I want to do RNA pull down assay, so I need to design primers with T7 promoter. I need assistance in this regard.

07 August 2024 6,622 1 View

How to fix background error in rietveld refinement of one XRD peak using GSAS-II?

I want to refine one XRD peak of my in-situ xrd but the background is never working good which ultimately fails the refinement. How to refine and adjust the background using GSAS-II

05 August 2024 5,291 2 View

How can I add own Henry coefficients in Aspen Plus?

Hi, i would like to simulate an absorption process in Aspen Plus. I want to use the NRTL model und would like to add some individual Henry coefficients. Is that possible and how?

05 August 2024 2,333 2 View

Why might the impedance values for DI water and 0.1X PBS buffer solution exhibit a decreasing and increasing trend, respectively over time (HP 4194A)?

Hello everyone, I'm encountering an issue with my electrochemical impedance spectroscopy (EIS) measurements and would appreciate some insights. Experimental Setup: Electrodes: Gold interdigitated...

05 August 2024 3,783 2 View

Can usage of AI tools like chat GPT in research work is recommendable ?

AI tools like ChatGPT can enhance research work significantly when used responsibly and in conjunction with thorough human oversight.

05 August 2024 1,842 3 View

Usage of internal standards in LC-MS/MS analysis?

Have you ever seen a LC-MS/MS method uses both internal standards and external standards (in matrix matching purpose) but the concentrations of internal standards are outside the calibration curve...

05 August 2024 3,084 6 View

ANY free software for reconstructing neurons in the microscopic image?

Hi everyone, I am working on brain slices for visualizing a protein in the soma and dendrites, using a fluorescence tag. However, I need a tool (not paid) for reconstruction of the whole neuron,...

04 August 2024 4,725 2 View

How effective is the Citi Bloc standard basket in enhancing the accuracy and comparability of international construction cost assessments?

Citi BLOC Standard Basket Definitions: A standardized unit representing a fixed basket of construction materials, labor, and equipment costs priced in various cities. Purpose: To create a common...

04 August 2024 8,997 1 View

Can you connect an HPLC to a Mass Spec only at a certain time point?

Can anyone explain this method? Especially the last statement where it says only at 1.5 to 2.5mins was the MS/MS connected to the UPLC. How is that possible, is it a feature in this specific...

11 August 2024 8,141 3 View

How to define an anisotropic material with asymmetric elastic compliance/stiffness matrix in ANSYS APDL?

I need to model an anisotropic material in which the Poisson's ratio ν_12 ≠ ν_21 and so on. Therefore, the elastic compliance matrix wouldn't be a symmetric one. In ANSYS APDL, for TB,ANEL...

09 August 2024 5,048 2 View

Is it true that $\det(V(A))$ may be only $\pm 1$, depending on $n$, for the last symmetric tridiagonal matrix $A$?

One can try to generalize the Vandermonde determinant in the following direction: Let $A$ be any symmetric $n$-order square matrix. Consider its powers' diagonal elements $(A^k)_{ii}$ and...

08 August 2024 6,690 1 View

Separation of organic acids-HPLC?

Hello What should be done to separate and identify organic acids in HPC when their RetTime is the same?Like oxalic acid with Propanoic Acid.or acids that have a very close RetTime.

07 August 2024 8,782 3 View

How does one derive the standard deviation of a scale?

Dear all, I am working on analyzing data from a survey on student satisfaction. The survey contains items with a 7-point Likert response format that produce 12 scales related to different areas...

05 August 2024 2,141 4 View

Usage of internal standards in LC-MS/MS analysis?

Have you ever seen a LC-MS/MS method uses both internal standards and external standards (in matrix matching purpose) but the concentrations of internal standards are outside the calibration curve...

05 August 2024 3,084 6 View

How can we use mobile apps for improving students' academic performance?

Mobile apps can be a powerful tool for enhancing academic performance, how can we use mobile apps for improving academic performance

04 August 2024 9,492 0 View

Why results of ROS flurescence are negative as there was no bacteria within?

Hello. I am working on ROS production of two systems: system A is cerium oxide and hydrogen peroxide, system B is cerium oxide nanoparticle, hydrogen peroxide and potassium bromide. I did some...

04 August 2024 5,974 3 View

Request a single Lecture notes for math as detailed as this that I can find in one place?

- The Existence/Uniqueness of Solutions to Higher Order Linear Differential Equations - Higher Order Homogenous Differential Equations - Wronskian Determinants of $n$ Functions - Wronskian...

03 August 2024 2,366 0 View

Training for new staff?

I am looking for some training for new staff that will be starting in a self contained classroom with students with ASD. Most new staff have little to no experience working with students with ASD....

03 August 2024 6,717 3 View