I have a single response, which denote the number of faulty products/total products (success, or failure ratio) over quite a number of factors (~15). Since these 15 factors are not independent from each other, the number of successes are close to zero in almost half of the cells (combinations of factors). When I perform logit regression, not surprisingly, I find that the expected values of successes in those cells are smaller than five. It is generally known that deviance or Pearson X^2 is close to normal distributed only if E{X} >= 5 (though this is a conservative estimate). However, expected values of quite a number of my cells are smaller than unity! In this case, can anybody recommend me what to do? As far as I can see, there seems to be two methods to deal with the problem:

1. Construct the model with all the data, find the combinations of factors for which E{X} < 5 (or maybe 3), omit these data and remodel. However, I would not be using some of the data points in (re)modeling.

2. Construct the model with all the data, but in deviance calculation use only the cells for which E{X} > 3-5.. And for degrees of freedom, use DOF = remaining cells - number of parameters. Here, the problem is I'm not sure whether this is statistically valid, since I would be systematically discarding some portion of the data in deviance calculation, and this may perhaps cause some bias.

Any help would be appreciated.

More Burak Alakent's questions See All
Similar questions and discussions