In this question, we assume we have a health dataset with many triplets of dummy variables. The dataset looks like this:
(existence_of_symptomA (1/0), symptomA_chronic (1/0), symptomA_persistent (1/0), existence_of_symptomB (1/0), symptomB_chronic (1/0), symptomB_persistent (1/0).......)
Each line represents a patient, and, the data are dummy because multiple symptoms may coexist per patient.
The outcome of interest is a dummy variable "hospital death" (1/0).
If you take a look at the data structure, you will notice that semantically the "existence_of_symptom" variables are the main ones, while the "symptom_chronic" and "symptom_persistent" describe characteristics of the "main" dummy variable.
If one wants to study the odds for death solely based on the existence of symptoms (just the existence_of_symptom variables) this would be a multiple binary logistic regression problem. This would create a model with the odds for death, for each symptom.
Here is the question: What would be the best approach to study the predictive contribution of the two extra "symptom_chronic" and "symptom_persistent" dummy variables per symptom? Would you simply add everything together into the list of IVs to run the logistic regression?
Wouldn't this approach be incorrect?
To begin with, everyone without a symptom will always have values of 0 to the chronic and to the persistent variables as well! Also how will the model recognize and "account for" the fact that data should be seen as triplets?
Any insights?