Dear Muhammad, PCA is a method of ordination in statistics, which means we use it to reduce the dimensions of our data to compare or get a simpler result for interpretation. the aim of PCA is to group variables in less dimensions. So when several variables defined by a component this means they are correlated. This is the main way of discussing components. But if you want to interpret them, you'd better use Factor analysis which is a more flexible method and can be done based on PCA algorithm plus a rotation phase which leads to better understanding.
Following attached the way Factor analysis works on data:
Fig a is result of First step of Factor based on PCA ( to this step every thing is like PCA).
Fig b is after rotation. These are vegetation Indices data based on Red, Blue and near IR bands of satellite.
EVI and EVIm has blue band in the formula of calculation as well as RED in NIR. but NDVI, NDVIm, SRm and SR just have RED and NIR bands.
The first graph (a) give us nothing around their formula, but after rotation you can see EVI and EVIm come near PC 1 axis and other to PC2 axis. based on second graph we can say that PC1 corresponds to Blue band and PC2 corresponds to RED and NIR bands.
There is several ways of rotation, here the rotation is Varimax, which is the most common method. This method just rotate the graph as you see to, first, lower the correlation between components as much as it can, second, assign the loadings to PCs in a way each variable has the most loadings on a PC and not both.
The best way to explain principal component analysis depends on your background.
If you come from mathematics or statistics:
Suppose C is a symmetric postive (semi) definite n x n matrix (in the application to statistics that will be a covariance matrix). Then the principal components are simply an orthonormal basis of eigenvectors e_1, \ldots e_n of C ordered such that the eigenvalues \lamda_i have non increasing eigenvalue:
Note that in the eigenbasis e_i C is diagonal and therefore
C = \sum_{i = 1}^n \lambda_i e_i e_i^t
Working in the eigenbasis it is easy to see that \lambda_1 can be determined from the covariance matrix as
lambda_1 = \min_\lambda { || Ce ||\le \lambda , for all e with ||e|| = 1}.
which then determines the first principal component e_1 explicitly by solving
C e_1 = \lambda e_1
Likewise
\lambda_2 = \min_\lambda { || Ce ||\le \lambda , for all e \perp e_1 and ||e|| = 1}.
which determines e_2, and we can continue as
\lambda_3 = \min_\lambda { || Ce ||\le \lambda , for all e \perp e_1, e_2 and ||e|| = 1}.
etc.
This is the point of calling the eigenvectors the principal components, they pick up the main possible variation, then the subleading terms etc. An other way to look at it is the following easy to prove fact.
Suppose that C' is matrix of rank \le r such that ||C - C'||^2 = tr(C-C')^2 is minimal. Then
C' = \sum_{i = 1}^r \lambda_i e_ie_i^t
This means that if C =cov(X_1, ...X_n) for some (possibly correlated) random variables X_1, ... X_n then the new random variables Y_i = e_i^t(X_1, ....X_n) are mutually uncorrelated and var(Y_i) = \lambda_i. Moreover if you have to choose just one variable that explains most of the (co)variance of the X_1, \ldots X_n it will be Y_1. If you can choose the variable that explains most of the (co)variability of (X_1,...X_n) *not already explained by Y_1* it would be Y_2 etc. Note that the Y_i variables are defined by the data themselves, at least when the eigenvalues are all different and
If you have to choose just 2 variables explain the (co)variance of the (X_1, ..., X_n), then two independent linear combinations Z_1, Z_2, of Y_1, Y_2 do equally well. This freedom is useful e.g. if you want the Z_i to be as correlated as possible to as few of the X_i as possible and as independent of each other as possible. This is the difference between PCA and factor analysis. Note that my weasly wording: whereas PCA is a straightforward procedure, the precise procedure for choosing factors by "rotation" of the principal components (a true high dimensional orthogonal transformation if you insist on independence of the Z_i) has considerable more leeway for fairly arbitrary choices. In particular, if you take one more factor (i.e. increase r, to r+1) the first r principal components remain the same *but the first r factors obtained from taking r or r+1 factors can and typically are different*.
If you come from social science:
Suppose you have a dataset of 1000 questionaires each with the answers to 3 questions X, Y, Z, on a 10 point scale (in reality we usually have more than 3 questions but 3 will help us to simplify notation and visualise what is going on).
How can we give statistical description of the questionnaire?
The coarsest description is simply giving the means, the average answer, something like
\bar X = 3.57, \bar Y = 6.34, \bar Z = 7.23
The next level of statistical accuracy would be given by giving the standard deviations or the variances of X, Y and Z, but let us assume that the questions are asking for a single concept and strongly correlated. Then we have to look at the covariance matrix:
cov(X,Y,Z) = \overline{ (X - \bar X, Y - \bar Y, Z - \bar Z)^t (X - \bar X, Y- \bar Y, Z - \bar Z)}
That is 9 numbers and it would be 25 if we had asked 5 questions instead of 3. However we assumed that there was an underlying concept that explains most of the underlying variability. One really efficient way to paraphrase the results is if we have a combination of the scores of the answers that gives us a measure of this underlying concept,together with a measure of its statistical fluctuation.
Each questionaire gives us a point in 3 dimensional space, e.g. a questionaire with X = 3, Y = 7, Z = 8, will give the point (3,7, 8).
The 1000 questionairs thus give a cloud of 1000 points in 3 dimensional space. The mean M = (\bar X, \bar Y, \bar Z) is the "centre of mass" of this cloud. If the answers were perfectly correlated all the points would be on a line through the centre of mass. In reality this never happens but we can still determine the line such that the least square of the distances to the line is minimal. The unit length vector e_1 = (a_1,b_1,c_1) from the centre of mass is the first principal component. The combination
measures the main deviation of the points in the cloud from being equal to the mean because M + U_1 e_1 is exactly the projection of a point in the cloud on the line. The score U_1 has mean zero and the variation of the score among the 1000 respondents explains more of the total variation of the X,Y,Z scores then any other linear combination of the X,Y,Z's (except that you could rescale U_1 to get the same information but that is cheating).
The unit vector e_2 = (a_2, b_2, c_2) orthogonal to e_1 such that e_1 and e_2 span the plane with least sum squared distances to point cloud is the second principal component, The score
measures the main deviation of the points in the cloud from lying on the best fitting line determined in the previous step. The U_2 score and the U_1 score are uncorrelated. The score U_2 has mean zero and the variation of the score among the 1000 respondents explains more of the total variation of the X,Y,Z scores *not already explained by U_1* then any other linear combination of the X,Y,Z's (except that you could rescale U_2 to get the same information but that is cheating).
We can continue approximating the point cloud by linear spaces of higher and higher dimensions to get all the principal components, as many as we have dimensions. In our case that would be 3 dimensions, one for each question.
Remark1. It so happens that mathematically the easiest way to determine the principal components is to solve the eigenequation equation
cov \cdot e = \lambda e
for pairs of eigenvalue \lambda and eigenvector e (normalised to unit length), and order the eigenvalues in decreasing (or more precisely non increasing) order. One can then show that variation of the U_i score is exactly lambda_i
Remark2: The principal components, especially the higher principal components tend to be hard to interpret because they mix (with postive and negative signs) all questions and subtract parts that "you already explained". Thus it has become customary to just look at the space spanned by r principal components and then in that total linear space, called a *factor*, take a basis and corresponding scores that correlate maximally with the scores of the original questions and minimally with each other. This is called rotation because it transforms the original principal component basis e_1, \ldots e_r to something that is usually more sensible (it is a true high dimensional rotation or orthogonal transformation if you insist on the faa. Unfortunately the coefficients of these scores (usually called the factor loadings) can and typically will change (in a notable way) if you increase the number of factors r, unlike the coefficients (loadings) of the scores defined by the principal components.
Dr Ehsan has illustrated well and presented the data in a nice manner. May I ask to Dr Ehsan that what is the difference between PCA and RDA. Can PCA be performed in Sigma plot software or SPSS or R software. How could any biologist interpret the data in the research paper. Will you please elucidate some examples.
As far as my limited knowledge is concerned, we mainly discuss first two or three principal components (PC) which play central role in the overall variance. We discuss the role of these components and their correlation with other important parameters. For more detailed information, Please have a look on the attached papers. May be my answer can be little helpful. Thank you.