Dear Team,
After using weight of evidence & Information value mechanism, of the 40 odd variables i am left with 8 variables which are highly or moderately significant.
One of the independent variable which is categorical has 60+ categories. This is a very highly predictable variable hence please suggest as to how should i use this variable in the model.
When i add this variable in the model my null deviance and AIC decreases and makes other predictors turn insignificant.
Then another model without this variable my null deviance and AIC improves. What could be the reason. Is this variable collinear with some other predictor.
Please see the syntax: < Without that Categorical Var>
m1.logit|z|)
(Intercept) -1.5417 0.4281 -3.60 0.00032 ***
regionA 0.1445 0.3107 0.47 0.64182
regionE -0.0384 0.2056 -0.19 0.85190
regionJ -0.2955 0.2959 -1.00 0.31796
regionL -0.9134 0.7891 -1.16 0.24703
regionUnknown 11.4219 509.6521 0.02 0.98212
know -1.4286 0.2062 -6.93 0.0000000000042 ***
repS 4.1152 0.2051 20.06 < 0.0000000000000002 ***
und1 -0.2958 0.2126 -1.39 0.16398
case_statusClosed - Customer Closed 0.4417 0.4571 0.97 0.33386
case_statusClosed - Directed to IdeaExchange -0.4448 0.8567 -0.52 0.60358
case_statusClosed - No response from customer -0.9096 0.5767 -1.58 0.11477
case_statusClosed - Request out of Scope -0.4657 0.4401 -1.06 0.28998
case_statusClosed - Resolved 0.5707 0.3925 1.45 0.14599
case_statusWorking 11.9926 624.1940 0.02 0.98467
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2553.5 on 2540 degrees of freedom
Residual deviance: 1287.7 on 2526 degrees of freedom
AIC: 1318
Number of Fisher Scoring iterations: 13
Also ran an anova test to analyze the table of deviance
anova(m1.logit, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: survey
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 2540 2554
region 5 13 2535 2540 0.022 *
know 1 507 2534 2033 < 0.0000000000000002 ***
repS 1 715 2533 1319 < 0.0000000000000002 ***
und 1 3 2532 1316 0.109
case_status 6 28 2526 1288 0.000078 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I am actually not sure on how to use this factor variable with so many categories hence request the forum to advice.