in my dataset some of the categorical data have 18 or more levels? i want to ask can i merge them based on their frequency? or should i keep them in the model as they are? and at what basis should i combine them?
What ever logic you use for combining categories will have to be easily understood by your reviewers/readers. The most frequently used strategy is to combine the less common categories into a single "other category."
Collapsing the levels/categories of a categorical variable could be useful when catering for a theoretical reason (e.g., reducing respondents’ education level into just two categories, signifying university graduate or non-university graduate). It could also be driven by a decision after conducting data evaluation (e.g., having few observations in some categories). Either way, you could do the merge on SPSS following the procedure for recording a categorical variable as illustrated by van den Berg (2021) and KSU libraries (2021). You might refer to Rutkowski et al. (2019) and DiStefano et al. (2021) for inputs on rationalizing the collapse of the categories. Here are the full citations.
DiStefano, C., Shi, D., & Morgan, G. B. (2021). Collapsing categories is often more advantageous than modeling sparse data: Investigations in the CFA framework. Structural Equation Modeling: A Multidisciplinary Journal, 28(2), 237–249. https://doi.org/10.1080/10705511.2020.1803073
KSU Libraries. (2021, October 4). LibGuides: SPSS tutorials: Recoding variables. LibGuides at Kent State University. https://libguides.library.kent.edu/spss/recodevariables
Rutkowski, L., Svetina, D., & Liaw, Y.-L. (2019). Collapsing categorical variables and measurement invariance. Structural Equation Modeling: A Multidisciplinary Journal, 26(5), 790–802. https://doi.org/10.1080/10705511.2018.1547640
van den Berg, R. (2021, August). SPSS - Merge categories of categorical variable. SPSS tutorials | The Ultimate Guide to SPSS. https://www.spss-tutorials.com/spss-merge-categories-of-categorical-variable/
You can also change the modalities of the categorical variable using the frequency in order to do some learning processing and apply machine learning model. You also need to be aware since when you will combine them it may lead to loss of information
Yes, you can. The R package, Collapse, by Krantz et al. (2016) could be helpful. Here is the full citation.
Krantz, S., Dowle, M., Srinivasan, A., Berge, L., Eddelbuettel, D., Pasek, J., & Tappe, K. (2016). Collapse 1.6.5. Advanced and fast data transformation in R. https://sebkrantz.github.io/collapse/
Samaneh H.Bahreini though it is permissible to combine categorical data, it very much depends on how you want to utilize that data i.e., it will be used only for descriptive statistics or you plan to apply some inferential statistics as well.
By the way what do you mean by combining categories based on their frequencies? Furthermore, it would be possible only when you have meaningful categories like education level, for example, in respected Mohialdeen Alotumi example. Tq.
Muhammad Zia Aslam Dear Muhammad, thank you for the answer;
in my dataset for example, income has 20 levels of education has 9 levels and i have 14 independent variables, after descriptive analysis, i should run ordered and unordered logistic model and count model and compare them to find the best fit for my data, s
some of the income level for example has very low frequency, that is why i want to combine them toghether;
as i wrote to Muhammad Zia Aslam after preparing data,i should run ordered and unordered logistic regression on my data, i read some papers and chapters and found that i also need to do cross-validation, should i first do the cross validation before run the model and after modelling, should i again use other methods to fid goodness of fit?
Dear Samaneh H.Bahreini it is very obvious actually to have some groups with very low respondents when you will have too many categories. Anyways, as I could understand that you have measured these variables, such as income and education, as groups or categories with options 1,2,3,4,5......... for each group, you can easily expand your group participation by simply recoding to combine. According to my understanding it is neither manipulation nor an issue to worry about. From expand I mean raise the bar for example for the first income group from 1-1000 to 1-50,000 or whatever your specification of the higher income groups. For Logistic Regression you can take guidance from Prof. Mike Crowson's YouTube tutorials. Tq.
Muhammad Zia Aslam thank you very much, yes, all of my datasets are categorial data as you said1,2,3,...18. I just wanted to ask is there specific rules to combine for example 3 levels? i did not find anything online, i suppose i should combine the less frequent groups together.
@sorry David, could you please explain what do you mean? my income and age are categorial too, for example i have 18 levels of income or 5 level of ages,
If you have 18 ordinal categories ( that is each category is higher or lower than another) as an exposure variable, I would treat them as quasi continuous.
If the categories of the exposure variable are nominal (that is different and not ordered) I would group if needed on theory (but also being aware of the frequencies) but theory trumps.
@Kelvyn Jones I agree with you about the ordinal DV. Here I would use truncated regression and not OLS. Full details can be found in the attached screenshot. I certainly agree with all of your other suggestions. Best wishes to all, David Booth
Samaneh H.Bahreini yes, there applies only one rule in this case and that is "meaningfulness" according to my understanding. Your issue basically is related to pre data collection stage and can be resolved easily by making groups representation meaningful. In doing so you will not disturb any individual responses but will just make the group representation meaningful. So, I think you should move forward to your analysis by re-grouping the categories. Good luck. Tq
Muhammad Zia Aslam thank you for your consideration. I want to ask, is it possible to combine less frequent job levels together? i can not find any reference on it, i have more than 20 levels of jobs, can i combine less frequent levels together, or should they enter to the model as they are? i can not say something meaningful to combine for example housekeepers and jobless or students together just because of the frequency of the data. do you have any idea?
Muhammad Zia Aslam the idea of "collapsing" less frequent categories into an "Other" category is basically common sense, so you do not need a reference for it.
I totally agree with you Prof. David L Morgan . I think Samaneh H.Bahreini wanted to be secure by asking for a reference on the "common sense" of "collapsing" categories :). It is sure not needed. Regards.
Muhammad Zia Aslam thank you again for your answer, I really appreciate your help; my concern was mainly for job categories, I mean for interpretation of I combine less frequents together, then I have housekeepers, students, jobless, managers, so on in one level.
If you have a range of different jobs grouped together, then check to see if they equal the mean on your dependent variable. If so, then that would make this Other candidate a good candidate for the "omitted" category in a dummy variable analysis.