Hello,

I'm looking working on a clustering analysis and would be curious if anyone has ideas about how to deal with nested categorical variables.

Normally I would calculate a distance/dissimilarity matrix (Gower when some variables are categorical), and then feed this to a clustering algorithm of choice. Now what happens when some categorical variables are nested?

Fictious example

If measuring characteristics of water samples like turbidity, temperature, dissolved gases, and presence/absence of 50 chemical compounds in the water.

* presence/absence of chemical compounds can be treated as 50 separate binary/categorical variables

* but say that these chemicals belong to 4 groups of compounds?

Thoughts

We could simply add an additional categorical variable "group" and for more complex nesting "subgroup", "subsubgroup"... OK, but as far as I understand, Gower distance is a bit like Manhattan distance in that it calculates a distance for each variable and then adds weights. What but part of the information will be redundant, and even more so if there are more levels of nesting. I was wondering whether anyone has come up with something else to specifically deal with that. Maybe some form of weighting of the variables?

Looking forward to your inputs!

Mick

More Gi-Mick Wu's questions See All
Similar questions and discussions