Hello,
I'm looking working on a clustering analysis and would be curious if anyone has ideas about how to deal with nested categorical variables.
Normally I would calculate a distance/dissimilarity matrix (Gower when some variables are categorical), and then feed this to a clustering algorithm of choice. Now what happens when some categorical variables are nested?
Fictious example
If measuring characteristics of water samples like turbidity, temperature, dissolved gases, and presence/absence of 50 chemical compounds in the water.
* presence/absence of chemical compounds can be treated as 50 separate binary/categorical variables
* but say that these chemicals belong to 4 groups of compounds?
Thoughts
We could simply add an additional categorical variable "group" and for more complex nesting "subgroup", "subsubgroup"... OK, but as far as I understand, Gower distance is a bit like Manhattan distance in that it calculates a distance for each variable and then adds weights. What but part of the information will be redundant, and even more so if there are more levels of nesting. I was wondering whether anyone has come up with something else to specifically deal with that. Maybe some form of weighting of the variables?
Looking forward to your inputs!
Mick