I am looking at the availability of Principal Component Analysis for hydrogeochemical data sets displayed as time series and I would like to know your opinions .
Well, usually, PCA and other eigen vector techniques are applied in the space domain, but I believe there is nothing wrong with applying them in the time domain, i.e to several samples collected at the same place at various times and analyzed for various chemical compounds. I never saw it in the literature or practiced myself, but I see nothing wrong with it.
In the sequel of my comment, I believe it is also true that application of PCA in the time domain to hydro chemical data requires, as fundamental condition, that processes explaining the chemical composition of the water are not in a steady state condition or, put another way, that they are kinetically active. Otherwise, the system variance will not grow as time goes on and the application of PCA may not result. Does this make any sense?
It is good understanding view. This is for a single sampling point with various time steps. I just expanded your view for multiple points below.
In case of the fact that kinetic view is expanded to the degree that a hydrochemical data set with a single time domain is handled from different points and some of governing mechanisms that explain its variability are assumed reactive or kinetic, let think the opposite results. In the first one PCA will work for the variance changes , this means that kinetic view is true. But under the opposite situation it will not work for the variance does not change in any reliable sense, the kinetic one should be omit. Therefore PCA is not sensitive to time domain when handling multiple points.
This is generalized, as my comment, to the degree of multiple points and times.
Just out of curiosity, why would you use PCA on groundwater data?
If you have an issue of too many parameters in your model and a high degree of correlation among different , you can usually combine different variables. For example, you might have [Ca],[Na],[other metals],[water hardness],...... You can combine most of those into the water hardness variable.
Best of all, you have variables that you can actually describe to others, rather than PCA 1, PCA2, etc.
For each case, water hardness is naturally expected rising with extra major ions. I call the first case as articulated reactions 1 and the second one as superposition 2. PCA was applied for both 1,2. By using PCA, any researcher try to elucidate interrelation among chemical variables in other words which one is governing mechanism, articulated, superimposed or hybrid one 3
In statistical view, this methods allows reducing dimensionality (chemical parameters are represented in factor scores) and checking variances among parameters by factor scores.
Thank you for your contribution. Let me know how to handle and calculate non-confounded PCA scores by a software. Could you add your opinion about the software suites.
If you use PCA properly, the factor loadings/scores for each PC become a mangled mess which makes describing what had an effect difficult. Yes, you do minimize the number of variables in your regression model. But, how do you describe the effect of increased nitrogen loading from chemical fertilizers using PC 1, PC 2, etc.
Having worked in aquatic chemistry, I know that there are a lot of redundant variables in the chemical analyses and models they want you to build. You can usually do a better job by simply combining similar, highly correlated, variables together (which is what PCA does). Then, if you want to know what will happen when the value of 'X' changes, you can go back to your model and make that change and see what happens. Using PCA, you need to go back and create a new data point. Turn that into the PC scores for each PC you keep,...
I've taken a pair of courses in multivariate statistical analysis. Proper PCA analysis can yield some interesting results, graphically. Try explaining them though. In my classes and well as professional work, I have never found PCA to be that useful. My professors also taught us that PCA and similar methods should be a last resort, not a first, because of how difficult they are to work with.
With every data set I have ever worked with, where we should use PCA, I found that be eliminating insignificant terms, then significant terms with a high VIF, will yield the same number of variables as I would get PC in PCA. So, I avoid PCA whenever possible.
If you do not use PCA properly, then your PCA is complete garbage. I've read enough "stats" textbooks by non-statisticians to know that a lot of people in the sciences misuse PCA consistently. Do you normalize your data before using PCA? The answer should be yes. If not, you will get garbage results and if you follow a few of the books I read, you could name each PC as one of your original variables.
Since PCA loadings are based upon the variability of the variables, if you have 3 variables, Conc(0.1ppm to 0.5ppm), Voltage (1V to 5V) and Temp(20C to 50C), your first PC will be mostly Temp. The second PC mostly Voltage and the third PC mostly Conc. If you go back and change conc to ppb, then the range and variability of conc is greater. So, now PC 1 will be mostly conc. If you change voltage to units of # of electrons, now PC 1 will be mostly voltage.
One way to tell if your software is using PCA properly is to run a regression using the PC and look at the colinearity/VIF. True PC are orthogonal. So, the VIF between all the PC is 1.00. The correlation between each PC is 0.00000000000000000000. So, if you end up with different values than those, the PCA software is not doing the job it should be doing.