Is it essential to use a correlation matrix when the scales of variables are different in PCA?

Answer is: You really do not have any other choice as covariance matrix is not the correct entity to use. See Naik and Khattree (1994 or 1995) American Statistician Article....So it is not about "essential" as you are not required to do PCA at all - there could be other method.

I have not read your full question, my answer is based on your main title. But your first line struck me as indicating that it all may be a nonissue. If you just have two variables and have the correlation matrix as the entity for PC analysis, then first PC is x+y and second is x-y. You need no data at all as PC vectors do not depend on data!

Sagar Parajuli

Could you read my full question and answer? Although there are only two variables, there are multiple grid points.

Oleksiy (Alex) Chadyuk

With admittedly non-existent knowledge in climatology, I can only comment on a number of purely statistical issues with your analysis. First, you are clearly looking at the time-series; using straightforward correlation violates its assumption of independence of measurement. It may be the accepted approach in the literature -- so I really hope it is justified.

The fact that covariance and correlation produce different results tells me that there is something weird in your models for A and B, because the difference between correlation and covariance is merely a constant. Check normality of distribution of your variables A and B obtained with both methods. My guess would be that your covariance method produced major deviation from normality, which throws off your final analysis (because in this analysis the assumption of normality is violated). Your correlation method, I would expect, introduces less bias.

Finally, there is clearly something strange in the fact that you first make up two variables, with no scientific justification for inclusion or exclusion of underlying data, using an apparently arbitrary formula dictated by measurement method of convenience, rather than by theoretical justification, and then try to correlate them, while apparently violating mathematical assumptions of correlation itself. I am sorry to be so hard on you, but believe me, all kinds of mischief can result from analysis like that! Again, I may cry foul as much, but the literature in your field should guide you in what is good or bad in terms of your analysis.

If I were you, I would just take your raw data and throw it into the multivariate regression (after perhaps transforming the data into increments from the previous measurement -- Markov-chain style): first 12 columns as independent variables, and last 12 columns as dependent.

A statistically significant result from an analysis like that would make me much more confident. First, you do not select your data columns arbitrarily: if they do not predict, or are not being predicted -- you still have to explain that somehow. Whatever variables are significant, will be combined in the regression model based on their actual contribution (partial correlation, not covariance, if we are being technical), rather than based on arbitrary formulas.

And one more thing. Check for normality of distribution of all of your raw variables. Many instruments measure logarithms of the underlying variables, simply because the measurement method works that way. Transform each of your variables until you reach reasonable normality of distribution (paying attention to the sign of the resulting variable).

Sagar Parajuli

Hi Alex, thanks for a lot of good points. My dependent variable shows non-normal distribution (skewed left). Logarithmic transformation would help but some independent variables are non-linearly related to the dependent variable so I am not sure if it helps if I transform all the variables as you suggested to make them normal.

In my current analysis, If the covariance matrix produced major deviation from normality, that could also be desired if the predicted distribution shows similar distribution to the original data. My problem is a bit unique in the sense that the outliers in the dependent variable are highly important, and the predictor model should be able to capture the outliers to their best.

Using PCA, I am not making up two variables, I am actually trying to find the time-series signal shared by each independent variables (which were selected on physical basis) with the dependent variable. The observational data are not perfect for my regression analysis, so I must process them to extract the useful signal from them. I am certainly exploring a number of possible ways, and I have tried both combining the 12 columns and treating them separately.

Oleksiy (Alex) Chadyuk

Outliers are always important, but it is the likelihood of them appearing that really matters for regression. The method of regression assumes that the likelihood of your outlier appearing in the data is a certain nonlinear function of the distance from the mean. That function is called, you have guessed, the normal curve, so if your data is not normally distributed (as it should be according to the CLT, because you are studying an additive molecular phenomenon, with N close to infinity) then you are fooling the regression -- and eventually yourself.

So the first thing you have to do (after you have cleaned the data from outliers that are not supposed to be there, like somebody kicking your instrument while it was measuring) is to bring all your variables to normality. If the underlying theory of your phenomenon describes a non-linear relationship between independent and dependent variables, you have to use THAT transformation before you put your data into the regression, because regression is looking ONLY for linear relationships among normally distributed variables, so you have to help it by figuring out the transformations and getting the data normal before you feed regression with them -- this is what generalized linear model method is about in a nutshell.

Now, if you are dealing with a time series beyond one or two cycles, you have to look for harmonics with Fourier transform and then look for correlations between the variables' harmonics, and then it all gets very complicated. But I guess this is not the case with your data, so you just need to make sure that your transformation deals with autocorrelation in your data (i.e. when your measurement outcome is predicted non-trivially by the previous several measurements.) Look how this issue was approached in previous studies in your field for ideas on how to do that.

How can I prepare virus for a TEM or SEM imaging?

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

Is it possible to use the Fused Deposition Modeling (FDM) to additively manufacture interconnected porous structure generation of >100-200 micrometer?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How to define an anisotropic material with asymmetric elastic compliance/stiffness matrix in ANSYS APDL?