Logistic Regression Output In R for Categorical Predictor?

01 January 1970 4 7K Report

Dear Team,

After using weight of evidence & Information value mechanism, of the 40 odd variables i am left with 8 variables which are highly or moderately significant.

One of the independent variable which is categorical has 60+ categories. This is a very highly predictable variable hence please suggest as to how should i use this variable in the model.

When i add this variable in the model my null deviance and AIC decreases and makes other predictors turn insignificant.

Then another model without this variable my null deviance and AIC improves. What could be the reason. Is this variable collinear with some other predictor.

Please see the syntax: < Without that Categorical Var>

m1.logit|z|)

(Intercept) -1.5417 0.4281 -3.60 0.00032 ***

regionA 0.1445 0.3107 0.47 0.64182

regionE -0.0384 0.2056 -0.19 0.85190

regionJ -0.2955 0.2959 -1.00 0.31796

regionL -0.9134 0.7891 -1.16 0.24703

regionUnknown 11.4219 509.6521 0.02 0.98212

know -1.4286 0.2062 -6.93 0.0000000000042 ***

repS 4.1152 0.2051 20.06 < 0.0000000000000002 ***

und1 -0.2958 0.2126 -1.39 0.16398

case_statusClosed - Customer Closed 0.4417 0.4571 0.97 0.33386

case_statusClosed - Directed to IdeaExchange -0.4448 0.8567 -0.52 0.60358

case_statusClosed - No response from customer -0.9096 0.5767 -1.58 0.11477

case_statusClosed - Request out of Scope -0.4657 0.4401 -1.06 0.28998

case_statusClosed - Resolved 0.5707 0.3925 1.45 0.14599

case_statusWorking 11.9926 624.1940 0.02 0.98467

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2553.5 on 2540 degrees of freedom

Residual deviance: 1287.7 on 2526 degrees of freedom

AIC: 1318

Number of Fisher Scoring iterations: 13

Also ran an anova test to analyze the table of deviance

anova(m1.logit, test="Chisq")

Analysis of Deviance Table

Model: binomial, link: logit

Response: survey

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev Pr(>Chi)

NULL 2540 2554

region 5 13 2535 2540 0.022 *

know 1 507 2534 2033 < 0.0000000000000002 ***

repS 1 715 2533 1319 < 0.0000000000000002 ***

und 1 3 2532 1316 0.109

case_status 6 28 2526 1288 0.000078 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

I am actually not sure on how to use this factor variable with so many categories hence request the forum to advice.

Stephen Politzer-Ahles

First of all, AIC is a measure of deviance: lower AIC means a better model. So it is totally normal (and guaranteed) that adding a new variable to the model will make AIC smaller. Saying the AIC "improves" when you remove the variable is not correct.

It is also totally normal for other variables to become non-significant when you add a new variable into the model. It means that the new variable explains the variance that the other variables had been trying to explain, so once you include the new variable then there's not much variance remaining which is uniquely explained by the other variables.

Long story short, since it sounds like your model is behaving in a completely normal way, I don't really see what the problem here is.

Shivi Bhatia

Thank Stephen.

For the categorical variable with 66 categories, the name of the variable is support category. Does a bi-variate analysis work here on the dep variable with this categorical var and then remove where the # of survey received are very low.

The analysis done shows approx 89% of the responses are received on only 15 categories.

Salvatore S. Mangiafico

@Stephen Politzer-Ahles, one correction. AIC is not guaranteed to decrease if you add another term to the model. Criteria like AIC, AICc, and BIC balance the explanatory power of a model vs. the number of terms. So in the case of adding terms to a multiple regression, the r-squared will continue to increase, but the AIC should minimize at some point and then increase again as more terms are added.

Attached is a plot of AICc vs. more-full models. (Building the "full model" from http://rcompanion.org/rcompanion/e_05.html )

Salvatore S. Mangiafico

@Shivi Bhatia, --- not really addressing your questions ---

note that you are using sequential sum-of-squares tests for your anova.

You may want to use Type II sum-of-squares tests:

library(car)

Anova(m1.logit, type="II", test="Wald")

### or test="LR"

Unless you want sequential tests.

(More information on Type I, II, III sum-of-squares: http://rcompanion.org/rcompanion/d_04.html )

For another example that uses lsmeans for post-hoc, see http://rcompanion.org/handbook/H_08.html#_Toc459550754

Non Normal Data For Classification Models?

Why to use Bootstrap Sampling?

Modeling With 100+ Predictor ?

Statistical Analysis To Predict IT ticket Queue?

Statistics for 3 Level Dependent Variable?

Missing Value In A Classification Model?

Reading Multiple Files for Text Mining in R Using TM Package?

Random Forrest for Predicting Sales Prices?

When to Use Hosmer-Lemeshow Goodness of Fit or Kolmogorov-Smirnov Test?

Difference between VIF & WOE/ IV?

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How are iso-frequency contours plotted?

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Is there an alternative to a multinomial regression which allows the DV to be non mutually exclusive?