Can anyone help me with dealing with perfect separation in logistic regression in R?

03 March 2014 6 2K Report

I'm trying to run a multivariable logistic regression model in R (backward selection), where a lot of my variables are dichotomous (e.g. diabetes = yes/no). An issue for my data is that I have 34 patients and would want to test approximately 25 variables (there are issues with this if I'm not mistaken). Alternatively, I could reduce the number of variables I want to test by performing a pre-selection of variables for the multivariable model based on univariable analysis, but that'll also lead to some bias. Or I could also just not run a multivariable model, because sample numbers are too small, but where's the fun in that?

When running the model in R (especially when computing the confidence intervals), I get the following warning:

glm.fit: fitted probabilities numerically 0 or 1 occurred

which led me to this discussion:

http://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression

I've now applied a Firth's correction to the regression model (using the logistf-package), and my output occurs without warnings, however its completely different to my original model. Further, I'm not completely sure, this is what is necessary for my data. I also ran a bayesianglm() using the arm package, and here too, the stats look completely different.

I'm pretty new to stats, so any explanations on how to deal with these kind of problems would probably have to be formulated as simply as possible. In other words, I'm not too strong in math and think I might have gotten myself up shi**s creek with this whole issue.

Regardless of that last comment, if anyone has any ideas or feedback, that would be great.

In reference to this discussion http://andrewgelman.com/2011/05/04/whassup_with_gl/

it may also simply be a "bug" in the glm() function???

Berlinda Verdoodt

You may simply have too many variables for the number of patients. For dichotomous variables, it should be possible to find a combination of at the most 6 parameters that describes your population perfectly (2^6 combinations of yes/no possible) with a total population size of 34. Less is some of the patients are identical for the selected parameters.

So, either get many more patients, or build a hypothesis about which combination of parameters makes biological sense. Or throw out those that are highly correlated to other parameters of your dataset.

Oliver Maximilian Fisher

Hi berlinda,

Thank you very much for your reply. This is what i was thinking too, namely that i was "overloading" the model. I did exactly what you recommended. Empirically it was interesting too. As soon as the model went over 6 or 7 explanatory variables, R started stating warnings.

Out of interest which mathematical principle does the max 6 variables etc stem from and/or do you have a paper i could cite for this? As you know, medical reviewers will sometimes only "believe" statistical limitations if "proof" in the form of a citation is provided......

Thanks again for your input!

Kind regards,

Oliver

Berlinda Verdoodt

Hi.

The 6 parameters was just a calculation how many different combinations of yes/no (or 0/1) could be generated with n binary numbers. This is equal to 2^n. For 6 parameters this is 2^6 = 64, lowest power of 2 > 34. So, not an official statistical theorem, just a basic model as to why this goes wrong if those parameters are independent of each other. If they're highly correlated of course using them all would not lead to perfect classification, but each new parameter would hardly add anyinformation to the model, so that wouldn't really help either.

Dan Château

wow.. yeah. that's way too many variables. In fact, it's not the total number of cases in a logistic regression that's important, but rather the number of events for the outcome (or non-events if they are fewer).

Vittinghoff and McCulloch have a nice simluation of the bias and precision issues when you try to put too many variables into a logistic regression with too few events.

http://aje.oxfordjournals.org/content/165/6/710.full

Oliver Maximilian Fisher

Thanks both Berlinda and Dan. Great input!

Cheers,

Ivan Kshnyasev

Seems to me, any predictor that providing linear separation – is the best, out of competition! LogLik=0, AIC=2k, w=1.

https://www.researchgate.net/profile/Ivan_Kshnyasev/contributions/?ev=prf_act

How can I create a grouped barplot in R where grouping is based on higher/lower values of another factor?

How does one change the order of groups in boxplots?

Anybody have any experience with processing ELISA data in R?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

If we are using snowball sampling technique, how do we justify the true representativeness of the sample statistically? is there any statistical test?

Could dyes amplify the spectrum of light to a specific wavelength?

How to report results of Generalised Linear Mixed Models in a journal article?

Why 3 replicates for most biological assays? Is it enough to examine the data fits normal distribution?

Posthoc test lettering in JAMOVI?

Which statistical test should we use?

All math can be explained by iterator of code?

Is factor analysis with quantitative variables possible?

How do we pick data for determination of Validation Acceptance Criteria?