I want to relate how similar people are on their metabolite levels, to their similarity on personal attributes.
Let's say I have measured level for 500 metabolites in 1000 subjects. For each metabolite I generate z scores by subtracting the metabolite mean from the levels. I then generate the 1000 by 1000 correlation matrix for the metabolite z-scores (each element of the matrix is a sample correlation, sample size = 500). Plotting the matrix (ordered by subjects' personal attributes (e.g. age, height, BMI, sex)) as a heatmap, it is clear that similarity in some personal attributes predicts correlation in metabolite levels.
I'd like to model what predicts inter-subject correlation in metabolite levels. One can imagine modelling it with the pairwise subject similarity in various personal attributes, e.g. a multiple regression
metabCorr = b0 + b1 * ageDiff + b2 * heightDiff + ... + e
where
Standard multiple regression software won't yield correct regression coefficient p-values. That is because it assumes the rows of the dataset to which it is fit, are independent. Whereas here each row relates to a pair of individuals, making each row related to many other rows. Perhaps p-values would be correct if degrees of freedom were adjusted, e.g. to number of subjects (1000) minus number of explanatory variables.
Cannonical correlation analysis seems to be the answer. However I don't think I have enough data to accurately measure linear combos for both the metabolites and for the personal attributes. Therefore I thought to combine the metabolite level similarities into the one summary statistic, namely correlation.
Furthermore, how can I do this modelling in R?
Any thoughts on this would be greatly appreciated.