Why do I have good performances for the test set but bad performances in predicting new data ?

14 January 2021 7 6K Report

Hello :)

I'am trying to predict churn for the year of 2020 based on historical data from 2014 to 2019 on R Studio. The aim is to predict churn probability for every person.

At the beginning I had 75% of persons in the portefolio and 25% of churners => so I did downsampling and now I have arround 50 % each.

After cleaning, my database contains 405000 observations and 20 variables. Notice that I have 202'000 persons that churned from 2014 to 2019 and 203'000 persons in the portefolio for the year 2019 (the portefolio don't vary so much between a year and another).

I split my data into train and test set (70%, 30%). The test set offers good performances (accuracy and sensitivity > 0.7) for both random forest and logistic regression.

The problem is when I am trying to predict the probabilty of churn of the portolio for 2020 I obtain poor result (20% of person assumed as churners by the model that had effectively churned in 2020). Did I miss Something ? I can give more details if is it not clear.

Abdelhameed Ibrahim Popular answer

Dear Ameni Barh

If the training step is not performed correctly or based on a bad dataset, the testing will fail.

Raoul G. C. Schönhof

Hello Ameni Barh, is the problem specific to 2020 data or is performance on your test set bad in general? The real danger here is overfitting your model.

Cheers, Raoul

Ashwani Kumar

test data is used while tuning the network whereas new data is not

Helge Hecht

You are overfitting your model. Try also splitting it into 3 sets (Training/Validation/Test). Stop training based on validation error and then evaluate on the test set.

Also use cross validation with different sets.

Another sign of overfitting is also a too small training dataset, a too deep model (in case you train for very long) or too long training.

I hope this helps, feel free to ask more questions.

Abdelhameed Ibrahim

Dear Ameni Barh

If the training step is not performed correctly or based on a bad dataset, the testing will fail.

Héritier Nsenge

May be you are facing the overfitting problem. Try to split your data set in three parts(Train, Test, Validation)

Mohammed Saleh

Hi , I think you have two problems. One is that you use 20 variables this number makes confusing for training set. Try to reduce the by deleting not important variables. Secondly I recommend IRCC technique that I used before. You can find it in my publications. Thank you.

Hello researchers Is this a random laser or just fluorescence?

Dirty and clean?

Training for new staff?

How combine yolo with Faster R-CNN?

How to clean the CAD detector?

Which filtration method to go for run off water from dirty solar panels to be used again?

Is a reliability test necessary in my survey on translations?

Is it redundant to use both Random Forest and Decision Tree algorithms in the same regression project?

How do we pick data for determination of Validation Acceptance Criteria?

Will the leadership style used in the U.S. be successful in Australia, or will the Australians respond better to another?