I randomly interviewed 250 poor people and 250 non-poor people. Considering 1 for poor and 0 otherwise, does estimating a logit model aiming to capture the probability of becoming poor make sense? What are possible biases?
A logistic regression may make sense but you have things to consider:
- is a conditional logistic regression required? (yes if case and control are matched)
- you fixed arbitrary the prevalence of the condition of interest (either poor of non poor). You can distort several parameters associated with this selection process especially if your prevalence is far away from the true prevalence. In this case the robustness of your findings should be tested.
Sample selection biases in Logit model estimation can occur when the sample used for estimation is not representative of the population of interest. Some possible sample selection biases include:
Survivorship bias : Only considering individuals who "survived" a certain process or condition, while ignoring those who did not.
Self-selection bias : When individuals select themselves into or out of the sample based on certain characteristics.
Attrition bias : When participants drop out of the study at different rates, potentially leading to an unrepresentative sample.
Sampling bias : When the sampling method itself introduces bias, such as non-response bias or volunteer bias.
Endogeneity bias : When a variable is correlated with both the dependent variable and an independent variable, leading to biased estimates.
Omitted variable bias : When a relevant variable is not included in the model, potentially leading to biased estimates.
These biases can lead to incorrect conclusions and inaccurate predictions. It's essential to consider potential sample selection biases when estimating Logit models to ensure reliable results.