Would it be better if I don't enter into logistic regression model those variables with extremely unbalanced distribution in the 2 groups, even if statistically significant (p
There's a procedure to handle such things called Firth regression I suggest you try that it's available in many statistics packages. Best wishes David Booth
That's not the point. The point is if your data provide sufficient information regarding your model. On average, data from a Bernoulli variable with p close to 0 or close to 1 provides only very little information, but since every observation adss some information, it's eventually a matter of the sample size. So if your sample size if huge, you may have a good chance to see something interesting. But if your sample size is moderarte or even small, you just won't see anything.
To be clear, for LR this is not only a matter of having a "large" sample; the issue of "rare events" and how that manifests in MLE regards the absolute size of the smallest group, not its relative size. That is, you will have the same problem with a 100/10 split and a 10000/10 split. This is why various rules-of-thumb regarding N for LR are in respect to the size of the smallest group for the number of predictors. Obviously, I have no knowledge of the data, but the applicability of this depends on what is meant by "unbalanced."
Firth's clever work, although often simply used to get LR estimates when they would otherwise be intractable with MLE (complete or quasi-complete separation), is a general attempt to penalize MLE weights in respect to the "smallness" of the sample, with those penalized weights being asymptotic to the MLE wights as sample size increases. As it has a Bayesian basis, Firth's penalization of MLE can be seen as conceptually parallel to the penalization of OLS in ridge or LASSO regression estimation.
Hello Valentina Ferro. I thought you were asking about imbalance on explanatory variables. Did I understand that correctly? Is there a lot of imbalance on the outcome variable too? Perhaps you could show us the 2x2 table for the specific example you mentioned in the original question. That would help me (at least) to understand your question better. Thanks.
Bruce Weaver , I meant an unbalanced frequency distribution of the values of the dependent variable, given the occurrence of the values of the independent variable. I attached the table. The variable dependent is dichotomous : Group period 1 and Group period 2. Considering the independent variables 15 and 21 , the frequency in both cases is, respectively , 0 for the Group period 1 and 10.71% for the Group period 2. I was wondering if it was correct to enter these two variables (statistically significant ) into logistic regression model. I'm learning medical statistics and someone, who is more experienced than me, suggested me not to include.