I have a single response, which denote the number of faulty products/total products (success, or failure ratio) over quite a number of factors (~15). Since these 15 factors are not independent from each other, the number of successes are close to zero in almost half of the cells (combinations of factors). When I perform logit regression, not surprisingly, I find that the expected values of successes in those cells are smaller than five. It is generally known that deviance or Pearson X^2 is close to normal distributed only if E{X} >= 5 (though this is a conservative estimate). However, expected values of quite a number of my cells are smaller than unity! In this case, can anybody recommend me what to do? As far as I can see, there seems to be two methods to deal with the problem:

1. Construct the model with all the data, find the combinations of factors for which E{X} < 5 (or maybe 3), omit these data and remodel. However, I would not be using some of the data points in (re)modeling.

2. Construct the model with all the data, but in deviance calculation use only the cells for which E{X} > 3-5.. And for degrees of freedom, use DOF = remaining cells - number of parameters. Here, the problem is I'm not sure whether this is statistically valid, since I would be systematically discarding some portion of the data in deviance calculation, and this may perhaps cause some bias.

Any help would be appreciated.

Similar questions and discussions