Firth logistic regression is a special version of usual logistic regression which handles separation or quasi-separation issues. To understand the Firth logistic regression, we have to go one step back.
What is logistic regression?
Logistic regression is a statistical technique used to model the relationship between a categorical outcome/predicted variable, y(usually, binary - yes/no, 1/0) and one or more independent/predictor or x variables.
What is maximum likelihood estimation?
Maximum likelihood estimation is a statistical technique to find the best representative model that represents the relationship between the outcome and the independent/predictor variables of the underlying data (your dataset). The estimation process calculates the probability of different models to represent the dataset and then selects the model that maximizes this probability.
What is separation?
Separation means empty bucket for a side! Suppose, you are trying to predict meeting physical activity recommendations (outcome - 1/yes and 0/no) and you have three independent or predictor variables like gender (male/female), socio-economic condition (rich/poor), and incentive for physical activity (yes/no). Suppose, you have a combination, gender = male, socio-economic condition = rich, incentive for physical activity = no, which always predict not meeting physical activity recommendation (outcome - 0/no). This is an example of complete separation.
What is quasi-separation?
Reconsider the above example. We have 50 adolescents for the combination- gender = male, socio-economic condition = rich, incentive for physical activity = no. For 49/48 (not exactly 50, near about 50) of them, outcome is "not meeting physical activity recommendation" (outcome - 0/no). This is the instance of quasi-separation.
How separation or quasi-separation may impact your night sleep?
When separation or quasi-separation is present in your data, the traditional logistic regression will keep increasing the co-efficient of predictors/independent variables to infinite level (to be honest, not infinite, the wording should be without limit) to establish the bucket theory - one of the outcomes is completely or nearly empty. When the anomaly happens, it is actually suggesting that the traditional logistic regression model is outdated here.
There is a bookish name of the issue - convergence issue. But how to know convergence issues have occurred with the model?
- Very large co-efficient estimates. The estimates could be near infinite too!
- Along with large co-efficient estimates, you may see large standard errors too!
- It may also happen that logistic regression tried several times (known as iterations) but failed to get the best model or in bookish language, failed to converge.
What to do if such convergence issues have occurred?
Forget all the hard works you have done so far! You have to start your new journey with an alternative logistic regression, which is known as Firth logistic regression. But what Firth logistic regression actually does? Without using much technical terms, Firth logistic regression actually leads to more reliable co-efficients, which helps to choose best representative model for your data ultimately.
How to conduct Firth logistic regression?
First install the package "logistf" and load it in your R-environment.
install.packages("logistf")
library(logistf)
Now, assume you have a dataset "physical_activity" with a binary outcome variable "meeting physical activity recommendation" and three predictor/independent variables: gender (male/female), socio-economic condition (rich/poor), and incentive for physical activity (yes/no).
pa_model