14 January 2021 7 6K Report

Hello :)

I'am trying to predict churn for the year of 2020 based on historical data from 2014 to 2019 on R Studio. The aim is to predict churn probability for every person.

At the beginning I had 75% of persons in the portefolio and 25% of churners => so I did downsampling and now I have arround 50 % each.

After cleaning, my database contains 405000 observations and 20 variables. Notice that I have 202'000 persons that churned from 2014 to 2019 and 203'000 persons in the portefolio for the year 2019 (the portefolio don't vary so much between a year and another).

I split my data into train and test set (70%, 30%). The test set offers good performances (accuracy and sensitivity > 0.7) for both random forest and logistic regression.

The problem is when I am trying to predict the probabilty of churn of the portolio for 2020 I obtain poor result (20% of person assumed as churners by the model that had effectively churned in 2020). Did I miss Something ? I can give more details if is it not clear.

Similar questions and discussions