Dear Team,

After using weight of evidence & Information value mechanism, of the 40 odd variables i am left with 8 variables which are highly or moderately significant.

One of the independent variable which is categorical has 60+ categories. This is a very highly predictable variable hence please suggest as to how should i use this variable in the model.

When i add this variable in the model my null deviance and AIC decreases and makes other predictors turn insignificant.

Then another model without this variable my null deviance and AIC improves. What could be the reason. Is this variable collinear with some other predictor.

Please see the syntax: < Without that Categorical Var>

m1.logit|z|)

(Intercept) -1.5417 0.4281 -3.60 0.00032 ***

regionA 0.1445 0.3107 0.47 0.64182

regionE -0.0384 0.2056 -0.19 0.85190

regionJ -0.2955 0.2959 -1.00 0.31796

regionL -0.9134 0.7891 -1.16 0.24703

regionUnknown 11.4219 509.6521 0.02 0.98212

know -1.4286 0.2062 -6.93 0.0000000000042 ***

repS 4.1152 0.2051 20.06 < 0.0000000000000002 ***

und1 -0.2958 0.2126 -1.39 0.16398

case_statusClosed - Customer Closed 0.4417 0.4571 0.97 0.33386

case_statusClosed - Directed to IdeaExchange -0.4448 0.8567 -0.52 0.60358

case_statusClosed - No response from customer -0.9096 0.5767 -1.58 0.11477

case_statusClosed - Request out of Scope -0.4657 0.4401 -1.06 0.28998

case_statusClosed - Resolved 0.5707 0.3925 1.45 0.14599

case_statusWorking 11.9926 624.1940 0.02 0.98467

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2553.5 on 2540 degrees of freedom

Residual deviance: 1287.7 on 2526 degrees of freedom

AIC: 1318

Number of Fisher Scoring iterations: 13

Also ran an anova test to analyze the table of deviance

anova(m1.logit, test="Chisq")

Analysis of Deviance Table

Model: binomial, link: logit

Response: survey

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev Pr(>Chi)

NULL 2540 2554

region 5 13 2535 2540 0.022 *

know 1 507 2534 2033 < 0.0000000000000002 ***

repS 1 715 2533 1319 < 0.0000000000000002 ***

und 1 3 2532 1316 0.109

case_status 6 28 2526 1288 0.000078 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

I am actually not sure on how to use this factor variable with so many categories hence request the forum to advice.

Similar questions and discussions