Dear experts in statistics, I have this question for you. What's the point in performing e.g. hierarchical clustering of a correlation matrix? I do not mean that the clustering is based on the correlations among the original variables, but exactly what I wrote: doing the clustering on the square correlation matrix. so, suppose to have 30 variables each with 100 measures, in the first case you input to the algorithm a 30x100 matrix, in the latter a 30x30 matrix. If we perform the clustering based on correlation (30x100), we get clusters of variables behaving similarly, but working directly on the correlation matrix would give clusters of variables whose pattern of correlation with the other variables is similar. Are the two exchangeable? I would say no, and until now I avoided doing the clustering of the correlation matrix mainly because it was strange to me first to calculate a correlation matrix and then calculate a distance on it to perform the clustering (so "distances" are calculated two times*). However, I am receiving papers to review where they did so and I am not able to understand why they did so and more importantly if there are advantages (except maybe sometimes a nicer figure). Moreover, in principle, if some normalization is performed on the matrix before clustering (for instance standardization on the lines), what once was the same correlation i,j and j,i (i rows and j columns) might become different. In addition, with the correlation matrix you need to perform the clustering of both the columns and the rows simultaneously, otherwise the symmetry will be broken.

Moreover, two variables that are lowly reciprocally correlated might happen to have a similar profile of correlation to another variable. By performing a few examples, it seems to me a very different situation. See figure attached. This is simply the heatmap of log2(C./CC) where C is the correlation matrix of random normally distributed numbers, and CC is the correlation matrix calculated on C (so in fact it is log2 of the ratio of the correlations calculated to perform the clustering on original data, and the correlations calculated on the correlation matrix, as it happens when one use the correlation matrix to make the clustering, without specifying that it is a correlation matrix).

To conclude: is it right to cluster a correlation matrix by calculating a new distance matrix on it? Can someone formalize this from a statistical/mathematical point of view such that there is no way of being contradicted?

*this can be told for instance to the clustergram function, but often in papers, the correlation matrix given as input will be treated as a "normal" matrix of measurements and a new distance will be calculated.

More Matteo Brilli's questions See All
Similar questions and discussions