I would like to use the multivariate statistical methods on my data from a lake. How can I use the principle components analysis and cluster analysis to understand the possible sources of the chemical proxies. Detailed step by step
Cluster Analysis will group together "similar" samples. You will need quite a few variables to test to get a good, representative clusters.
I've taken a pair of classes on multivariate stats and taken several more classes that "taught" PCA. My multivaraite stats profs spent all of an hour, combined, on PCA. They didn't see much use for it. If PCA is done properly, you take a bunch of highly correlated IVs and create PC for regression analysis. PCs are fairly difficult to explain because they are percentages of the original IVs. They are also orthogonal vectors. If the VIF of the PCs you created do not have a VIF or Tolerance of 1.0, you did your PCA wrong.Having done PCA on data like yours, I never come up with anything valuable using it.
The number of PCs you come up with tends to be the number of variables that are significant and have a low VIF. So, if you come up with 3 PCs from your data, you can reduce your original data set to 2-4 variables. Since PCA is supposed to be a "Model Reduction" technique, removing insignificant terms from your model and removing significant terms from your model with a high VIF will do the same thing.
I keep mentioning using PCA properly. The reason is, a lot of the "textbooks" written by non-statisticians use PCA improperly then proclaim how great it is. To create PCs properly, you need to normalize the IVs first. Then create your PCs.
Good luck though, you might be able to come up with something of value.hope it works out well for you!
One of my projects is working on exactly this type of problem so this is going to look like a bit of a plug for my own work - I apologise for that. I'm not exactly tracking pollution sources in lakes, but general cluster analysis of atmospheric chemistry data and also collaborating on using my techniques for object tracking.
I do not use any PCA or prior analysis of the data other than normalising (for offline) or standardising (for some online) data. I largely agree with Andrews comments above, PCA is useful if you need to limit the data you use for analysis. I prefer to go the other way and use as many dimensions of data as possible, so I am working towards dimensionally independent clustering.
You key considerations are what and how you want to create these clusters?
Do you want to find regions of the lake where pollution is similar, or do you want to try and create a path of the pollution?
I would imagine that the pollution chemistry disperses somewhat as it moves away from it's source, so you need to deal with data drift. It will also not move around in nice ellipsoidal or spherical bubbles, correct?
This rules out pretty much all single pass clustering techniques as they will find these styles of groups.
So, two pass clustering techniques are the way to go. With these you find 'micro-cluster' of spherical or ellipsoidal shapes and these are 'joined' in various ways to create arbitrary shaped groups. Probably the most popular of these is DBScan, DENClue and variants. However, these can slow dramatically with multi-dimensional data.
Next week I will be presenting a new technique, CODAS, Clustering of Online Data in Arbitrary Shapes at CYBCONF20125 in Poland. This is an online technqiue that clusters data as it arrives. It's pretty much dimensionally independent so you use as many input chemicals as you like, and of course use your x, y, z spatial co-ordinates to effectively track the clusters. It does not forget any of the cluster regions, so could be used to build a path of the pollution, but with no time component.
This has been tested on real climate data from an airborne data campaign and successfully identified anomalies in data in flight, in real time that other techniques would miss.
I also have, in testing phase, CEDAS, where the 'E' is for 'evolving' which evolves the clusters over time, i.e. it will 'forget' clusters that have not received data for a period of time. This is exactly for tasks such as tracking clusters of chemistry over time. For example, if some pollution were released today and drifted across your lake then at any point in time you can see where the pollution is. You could save this data and have a historical record of where it was.
I will be taking the key elements of CODAS and implementing them into my DDC technique, which is an offline method, so that it will also produce arbitrary shapes and be largely dimensionally independent.
You can see some demo videos on my Lancaster University web page. Particularly relevant to tracking pollution would be the second video on CEDAS. There you can see the whole dataset plotted 'post collection' with the data being tracked as it arrives.
Over the next few weeks I will be tidying up code and releasing it under the GNU public licence so you will be free to try it.
I will release code for Matlab, Octave and eventually 'dll' files for use with C and or C++.
While I agree with you in a general sense about the need for caution in applying and interpreting principal components analysis, you seem to have gotten a couple of facts wrong about the method.
First, you say, "PCs are fairly difficult to explain because they are percentages of the original IVs." In fact, principal components are LINEAR COMBINATIONS of the original independent variables, not percentages of them. Where the percentages come in is that the principal components, by definition, are related to the eigenvectors of the design matrix created by the model using those independent variables. Put more simply, principal components are those linear combinations of the original independent variables that account for the highest possible percentages of the VARIANCE of those variables. So the first principal component will be the single linear combination of those variables that has the highest variance, the second component second highest, and so on.
You also note that, "The number of PCs you come up with tends to be the number of variables that are significant and have a low VIF." And, relatedly, "Since PCA is supposed to be a "Model Reduction" technique, removing insignificant terms from your model and removing significant terms from your model with a high VIF will do the same thing."
These are both factually false. There is NO systematic relationship between the number of principal components and the number of "significant' variables in a model, except perhaps in very specific, trivial cases. And, similarly, simply removing insignificant terms from the model is not even close to accomplishing the same goal as a principal components approach. Treating them as the same will give you very misleading results. Remember, principal components are linear combinations of random variables, optimized to maximize the variance of each component.
With the linear combinations of the variables, if you normalize the variables before you run PCA, the factor loadings tend to not have a dominate variable. Thus the loadings are complicated combinations that ARE difficult to explain.
If you don't normalize the data first, the PC's are not orthogonal. Which kind of defeats the purpose of PCA. Further, the factor loadings in PCA are dependent upon the range of values of the original variables. So, if you have a system with pH(5-9), Voltage(1V-2V), Mass (1g-10g), PC 1 will be dominated by Mass, PC 2 dominated by pH and PC3 dominated by Voltage. If I change Voltage to # electrons, all the sudden PC1 is dominated by #electrons. If I change mass from grams to nanograms, mass dominates PC 1 again. If I change pH of micromoles H+, pH dominates PC1. Which gets back to my statement about just reducing the terms in the model for statistical significance and minimizing VIF.
In every data set I have dealt with, where PCA was a suggested method of analysis, I found reducing the model as I suggested lead to a model with the same number of terms as PCA. The difference is, if someone asks what happens when......? I can answer it directly, without transforming the new data point into factor loadings, etc.I'm sure there are uses for PCA. My stats profs don't feel there is.Several practicing statisticians I have talked with haven't found a use for PCA either. It looks good on paper when done improperly. When done properly, its usually not that good.