In this question, we assume we have a health dataset with many triplets of dummy variables. The dataset looks like this:

(existence_of_symptomA (1/0), symptomA_chronic (1/0), symptomA_persistent (1/0), existence_of_symptomB (1/0), symptomB_chronic (1/0), symptomB_persistent (1/0).......)

Each line represents a patient, and, the data are dummy because multiple symptoms may coexist per patient.

The outcome of interest is a dummy variable "hospital death" (1/0).

If you take a look at the data structure, you will notice that semantically the "existence_of_symptom" variables are the main ones, while the "symptom_chronic" and "symptom_persistent" describe characteristics of the "main" dummy variable.

If one wants to study the odds for death solely based on the existence of symptoms (just the existence_of_symptom variables) this would be a multiple binary logistic regression problem. This would create a model with the odds for death, for each symptom.

Here is the question: What would be the best approach to study the predictive contribution of the two extra "symptom_chronic" and "symptom_persistent" dummy variables per symptom? Would you simply add everything together into the list of IVs to run the logistic regression?

Wouldn't this approach be incorrect?

To begin with, everyone without a symptom will always have values of 0 to the chronic and to the persistent variables as well! Also how will the model recognize and "account for" the fact that data should be seen as triplets?

Any insights?

More Dimitrios Zikos's questions See All
Similar questions and discussions