What is a good way to judge which groups to exclude from logistic regression analysis?

How are you including groups in the model?

In general I would not exclude any, but I would do a with and without analysis. If you do a random effects multilevel analysis of individuals in groups, imbalance is expected and modelled, and unreliable results based on small numbers within group are downplayed. If you have very few groups I would do a fixed effects analysis with dummies. I would not do a pooled analysis ignoring groups.

Fundamentally, those groups without performances issues are key, not aberrant!

If you do not understand these phrases ( fixed, random, imbalance etc.) then just ask.

Rob Green

Thanks Kelvyn Jones I've started by including all staff and then when i wanted to include grade in the regression I excluded two grades for whom none had performance issues. I thought I had to do that to get a good quality model. The standard errors for these are around 6,000 when i do include them, compared to up to 1 for the others. Actually though the change in significance and odds ratios for the other variables is negligible. When i wanted to include work division as a factor I also excluded staff in two groups where no staff had performance issues.

I think you are saying that it is best to keep groups in with low levels of performance issues. I'm not sure what you are saying about fixed effects analysis.

@Yanping Wang thanks but I'm not sure what you mean.

Bruce Weaver

Hi Rob Green. Is having a performance issue really a dichotomous variable? Is it possibly a count variable? Is there any way to measure the severity or seriousness of the performance issues? Just curious!

Kelvyn Jones

Going back to your original post; what do you mean by groups?

This is an article from a long time ago but it may help is seeing where I am coming from Article Does organization matter? A multilevel analysis of the deman...

Fixed effects modelling is putting dummies in the model to represent 'groups'.

Rob Green

Bruce Weaver there are different categories of performance issue and i have 9 models for the different types. Individuals do have multiple issues/occurrences and i have counted them in the data. I was assuming modelling as 0/1 would be reasonable though. Given the vast majority are 0.

Rob Green

@Kelvin Jones my data is at individual level. By groups i mean multiple individuals (rows) which share a common variable/characteristic. E.g. Grade=A

Rob Green

My independent variables are dummies. E.g. GradeA=0/1, Disabled=0/1 etc

Kelvyn Jones

So they are not functioning organizational groups just categorical variables or factors. I was confused by nonstandard language.

Your real problem is a combination of a relatively rare outcome - look up Firth regression, and over detailed classifications on multiple categories in relation to the data you have, it is presumably because you have few individuals (irrespective of performance issues) in certain categories . Can you create fewer meaningful categories? You may be helped by cross tabs of the predictors but fundamentally the combined categories have to make theoretical sense. Put simply if you have low mid and high categories on a variable I would not recode high and low to form a new code to get the numbers up. Nor would I make the choice of grouping on the basis of performance. This is akin to the problem of combining categories in a chi square analysis when the expected values are small in absolute value. So combine not drop if you can. Otherwise the results may be biased; you cam imagine a category with a large number of workers but no poor performances ; dropping them would lose valuable information.

Rob Green

Thanks Kelvyn Jones I could for instance group the 2 highest grades. But I'm wondering how i will know if i have a good model where i have some confidence in the p-values. I've starting looking at model diagnostics here

https://www.ibm.com/docs/en/spss-statistics/24.0.0?topic=risk-model-diagnostics

and here

https://stats.idre.ucla.edu/stata/webbooks/logistic/chapter3/lesson-3-logistic-regression-diagnostics/

I'm using SPSS not STATA so am having to work out how to do what the latter says.

My model fails the LINKTEST. And the apparent solution is to make sure that i include all the relevant variables and interactions if necessary. At the top of the page it does say that we assume that 'no important variables are omitted'. Problem is that i think my model doesn't include all the important variables. It has very low power to predict. Cox & Snell R Square = 0.02. Model predicts that all of the individuals don't have poor performance. But i wasn't really expecting the model to have predictive power. Just that it would test whether people in certain groups were more likely to be in poor performance, due to bias or something about their characteristics. It's hard to begin to think about factors which would have real predictive power.

Kelvyn Jones

Have a look at

https://statisticalhorizons.com/logistic-regression-for-rare-events

in SPSS see https://github.com/IBMPredictiveAnalytics/STATS_FIRTHLOG

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How are iso-frequency contours plotted?

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Is there an alternative to a multinomial regression which allows the DV to be non mutually exclusive?