Hi, I've built a logistic regression model to find predictors of having a performance issue in the workplace. Staff are either 1 - they have a performance issue, or 0 - they do not.
Approx 50 out of 3000 staff have a performance issue. I've excluded some groups from the dataset before i run the analysis as they have zero individuals in the performance issue group. The standard errors for their coefficients had come up as very large and seemed to indicate an issue. These groups were particular staff grade levels and divisions.
Should i also exclude groups which have only 1, 2 or 3 etc occurrences also? I wonder if they lead to less accurate estimates of the other coefficients. If i do then it affects what characteristics come up as statistically significant. E.g. in one model age is significant, in another disability, in another work location. The p-values move from being just above 0.05 to just below.
Is there a relatively easy way to justify which groups to include/exclude?
Thanks
Rob