Hi everyone,
I'm trying to estimate the probability of having a possible diagnosis X (presence vs absence) according to age (continuous variable), gender (3 modalities), and nationality (157 modalities) in a large sample of 30,000 individuals using a binomial logistic regression. However, some modalities of the Nationality variable are very small in number.
I considered several approaches to deal with this situation, including grouping certain nationalities together, but this didn't seem appropriate.
I considered setting a minimum threshold of 10 individuals with the X diagnosis, and removing from the analysis those nationalities for which there were not at least 10 individuals with a positive X diagnosis. But this would create a bias in the regression, as I wouldn't have any nationality where all individuals had a negative diagnosis.
So I'm thinking I'd better select the nationalities represented by at least 10 individuals, no matter how many have a positive or negative diagnosis.
But I risk having, for example, 40 Belgian individuals, 5 of whom have a positive diagnosis and 35 of whom have a negative diagnosis. But then I might not have enough statistical power.
What solutions could I apply without losing too many individuals from my sample and without compromising the quality of my regression model?
Increase the number of individuals? Select nationalities with at least 100 individuals (n= 55 nationalities)? or another solution?
Thank you in advance for your help and advice.
Best