I am using a multinomial logisitc regression in Stata. I want to include 75 districts as a possible explanatory variable, especially to check region effect. Could anyone please help me how to perform it?
I think you can do it without dummy variables. You can create a region variable with districts as its levels by giving a unique number to each district. With this approach you will also be able to plot the predicted probabilities across districts. For plotting the results of multinomial logistic regression please see: https://stats.idre.ucla.edu/stata/dae/multinomiallogistic-regression/
For creating dummy variables, a tutorial is available at: https://www.stata.com/support/faqs/data-management/creating-dummy-variables/
One possible reason for this error may be that there are not enough data points for the dependent variable from each district. Or the combination of multiple explanatory variables such as gender in each district do not leave enough data points for estimation of model parameters. If this is the case, you may solve this problem by joining adjacent districts into larger regions. Or leave the top 2, 3 or 10 districts as is and label the rest of the districts together as "Other districts".
If the above is not the reason for the error, can you please share more information about the model and the error you are getting.
Can you provide information about your sample size, research question, and why you don't want to treat district as a random variable (i.e., within a multilevel model)?
My sample size is around 11000. I wanted to test the determinants of self-employment. I have categorical variables in my dependent. I wanted to check districts, industry and occupation in my independent variables. I particularly thought of checking region effect in order to know if it varies by region.
I have not considered it yet. Estimating the region effect is not the main goal of my research. Therefore, I was thinking of using dummy instead but it did not work as I mentioned in one of my answer above. Since you brought up the multilevel modeling, I have little confusion since I am not that familiar with this model. How could I manage the categorical dependent variable in multilevel modeling? I have 4 choices in my dependent variables.
Hi Usman Rashid Thank you for your answer. It does appear that some of the districts have very few respondents. So I tried to follow your suggestion and the stata did produced the result. Since my sample is from household survey, do you think it will be good enough to perform this way? Or as one of the fellow respondent mentioned below using multilevel model. I am quite confused which model will be best suitable.
Yes, I agree with Daniel Wright that a multilevel model is more appropriate for this situation. A brief example of multinomial multilevel logistic regression is given in: https://www.stata.com/manuals13/semexample41g.pdf
In your case, you can have districts and households nested within districts as random effects. Occupation and industry will be your fixed effects. Here is the Stata manual on multilevel models: https://www.stata.com/manuals13/me.pdf
(Here is one for my ego) One of the earliest uses of multilevel multinomial logistic regression, outside of the statistics literature, is Article Comparing system and estimator variables using data from real line-ups
. In the appendix we describe the procedure, but this is 23 years old and you are using STATA, so I'd go with their sources. They do cite the study in their multilevel modeling stuff, which when I visited there I was happy to see! The version they cite in the materials @Usman references was the stats details of the main report (Anne Sparks and Anne McDaid are the same person).
Thanks a lot Usman Rashid and Daniel Wright . I briefly went through these articles and it makes more sense to me now. I will read them thoroughly and try to follow and apply it.
Daniel Wright Thanks for sharing useful information and a bit of interesting history. It's always motivating to experience how scientific knowledge advances with interconnected contributions from people all around the world.
It can sometimes be difficult to estimate if there are no (or very few) specific outcomes in a district ; it makes a lot of sense to choose the base category (the left out one) sensibly eg the most common, and in ordinal I sometimes find it works ordered lowest to highest but not the the other way around. It may also be that you have to group the outcomes if a particular outcome is rare. So some trial and error may be necessary, but I have never failed to get estimation. Stata is notoriously slow with large data sets and that is when specialist software like MLwiN comes into its own.
Hi Usman Rashid and Daniel Wright , I tried using the multi-level multinomial logistic regression. I am not fully aware if I am doing it correctly. First, I tried the simplified version of non-multi level command :
Sabina Thapa Magar I am not well versed in STATA. You may also look at the MLwiN shared by Kelvyn Jones , chapter 10 in the manual along with its do/log (at: http://www.bristol.ac.uk/cmm/software/runmlwin/examples/).
Assuming that the model structure is correctly specified, small variance for district as random effect suggests that the differences between the districts are small. I see i.dist as a fixed effect and dist as random effect. What is i.dist?
I think household nested within district will be specified as: M1[dist] M2[dist>hhid ]. Here is an example: https://www.stata.com/manuals/semexample39g.pdf
Thanks Daniel Wright for your contribution. Your article did helped me a lot. But I got stuck with syntax now as I am also using this model for the first time.
Hi Usman Rashid , The last command should not include i.dist in fixed effect. I posted the wrong command, sorry. But other than that my question was exactly for using hhid and district as random effects. I will check the second link that provided as well as the one Kelvyn Jones suggested. Thank you very much.
i.dist is for district. Then it means any of the above command should not include district variables in fixed effect. So I should be excluding the district variables from all the command and instead use as nested.
Sabina Thapa Magar Yes, in this case it does not make sense to include district as both random and fixed effect. Following model sounds plausible:
employment_status hhid)
This model estimates log likelihoods for employment status corresponding to the categories of (sex schooling total_hours occupation industry) while attributing the variance in the data to individuals from different households located in various districts.
Sabina Thapa Magar Yes, but not necessarily. If the research question is to evaluate differences across industry and occupation, I would focus on these.