I've got 16 sites (groundwater, streams, wetlands) that I'm using PCA to help discriminate the grouping of sites based on their water quality (26 variables).
First issue: 3 of the 16 sites have much higher conductivity, alkalinity etc readings such that only 2 of the variables are normally distributed before transformation. After log or square root transformation, only 8 of the 26 variables are considered normally distributed (either a Shapiro Wilk value >0.85 or p>0.05). Do I have to remove these vastly different sites or remove the worst of the non-normal variables? Any ideas on the impact on the resulting PCA?
Second issue: The conductivity is highly correlated to a number of other variables such as (Total Alkalinity, Hardness, Total dissolved solids, Total dissolved ions, Calcium, Sodium, Chloride, Manganese, Magnesium). Leaving all these highly correlated variable will/do heavily weigh the PCA in the favour of the PC1 (dominated by all of these variables) because the PCA is maximising the variance explained. Has anyone got any references that discuss this and provide advice on approach?
Thanks
Andrew