I am doing PCA of a data of 9 variables. How many components can I retrieve from these variables? Do I have a choice to have the components 'of-my-choice'? I am using SPSS software for the same.
just to add a detail that is surely given in the chapter that Ehsan pointed to: the components are already sorted from the one with the highest eigenvalue, i.e. highest explanatory power, to the one with the lowest. So it does make sense to include eigenvalues "from left to right", instead of just picking one of your choice - because one in the middle does just not explain near as much as the "first" - which is easily visible in the mentioned scree plot. Also, you get as many components as you enter variables, thus, in your case 9.
As it was mentionned feel free to read here an exampele with ordered eigenvalues
NEwCP eigenvalue % variance cumulative % of variance
comp 1 3.203 40.036 40.036
comp 2 1.761 22.016 62.052
comp 3 1.172 14.655 76.707
comp 4 0.821 10.261 86.968
comp 5 0.471 5.884 92.852
comp 6 0.244 3.053 95.905
comp 7 0.22 2.751 98.657
comp 8 0.107 1.343 100
1- Eigenvalue > 1 ==> select Comp1 to comp3
2- % variance > ratio=(100/number of comp =8) = 12.5
ratio = (Entire % variance = 100% ) / (number of active variables used in PCA(here = 8) ==> ratio= 12.5
==> we keep comp with % of variance > ratio =12.5
==> Confirm the rule 1
3- scree plot oh eigenvalues
4- up to 70% of cumulative % of variance
==> in this exmaple we have 76% of teh variance explaned by 3 comp.
Additional Remark:
Sometimes when we have data with outcome variable to explain with reduce dimension, we can set this outcome variable as supplementary quantitative variable so it will not participate in determining new principal axis.
One of the best ways of addressing this is through Kraznowski's cross validation approach. Eastment, H. T., and W. J. Krzanowski. "Cross-Validatory Choice of the Number of Components From a Principal Component Analysis." Technometrics 24.1 (1982): 73-77. The cross validation is based on successively deleting the row and column for each cell in the matrix of observations, constructing a singular value decomposition from the reduced matrix, and projecting into the reduced space of k dimensions. The between point distances are compared in the reduced space and the full space, The authors use an updating formulae for rank 1 changes to the SVD which give a very efficient algorithm.
Another thing, I have, suppose, 9 variables. I want a component of variable no 2, 4 and 6 together, specifically, can I choose only these three variables 'manually' to have a component. And can this be compared with another component 'of-my-choice'?
No good reason comes to mind for doing this, other than wilfully distorting the picture that the analysis presents to you. While variable 2 might still have some explanatory power, variable 4 will have only a fraction, and variable 6 yet a smaller fraction - so they are debris compared to variable ONE and two - just dust particles. But standing in front of a painting you look at Mona Lisa, not at the dust particles covering the painting. There may be very few questions where you are more interested in the dust than in the colourful painting and you would have to name those questions and reasons in order to convince people. Just in general, - usually, you apply such methods as PCA to help you reduce the multidimensionality to something that you can handle - you don't use these methods to give you the picture you want to see in the mud - it's possible of course to ab-use statistics in that sense, but it's not ethical in my mind. If you are so interested in what variables 2 , 4 and 6 MIGHT tell you then it would be best to design a new experiment testing appropriate questions regarding these variables and analyse the new data.
I agree with Susanne. and I'll show here with an example that choosing variables to build principal components(PC) follow some statistical rules. so you need to look at the contribution of variables in building the new axis(PC). In this practical example you can use a dimdesc function that select you the variables contributing in PC with a significance probability.
I took here just an example of first PC (PC1) to show the following result ordered by the p_values.
And you can see the importance of variable's contribution following their big absolute correlation's value which correspond to their increasing p_values.
correlation p_value
Points 0.9561543 0.000000000
100m -0.7747198 0.000000003
110m.hurdle -0.7462453 0.000000021
Long.jump 0.7418997 0.000000028
400m -0.6796099 0.000001028
Rank -0.6705104 0.000001616
Shot.put 0.6225026 0.000013883
High.jump 0.5719453 0.000093623
Discus 0.5524665 0.000180222
Remark : You can make a choice to select the highly correlated variables with PC1 or to fix a threshold (70% of correlation for example to select the well represented variable following this PC1).
AN : I attached the R code file to reproduce the results for this example
PCA is generally used as a tool of exploratory data analysis / data visualisation. However you can also think of it as a statistical model. You have p variables and you ask the question: can they all be represented as linear combinations of a much smaller number, say k, of variables (factors!), up to some additive random noise, uncorrelated over the p variables *and the same size for each variable*? Then in terms of the p x p covariance matrix of the variables you are studying, your model is: Sigma = Lambda Phi Lambda^T + sigma^2 I, where Lambda is the p x k matrix of coefficients in these linear representations, Phi is the k x k covariance matrix of the underlying factors, sigma^2 is a small positive number, and I the p x p identity matrix.
The usual PCA analysis actually estimates this model, which would usually be thought of as belonging to Factor Analysis.
There is a huge indeterminacy problem with this model: if you insert p x p non-singular matrix and its inverse twice (the second time: transposed) between the three factors Lambda Phi Lambda^T you get another representation of the same form. PCA has a mathematical criterion by which to pick one particular representation. But this need not be the most relevant to the scientific field of interest. The indeterminacy of the results can also be turned to your advantage. You are completely free to look for a particular representation where one of the factors has a large influence on three of the original variables, a small influence on the others.
I completely agree with Richard and completely disagree with Susanne. PCA is usually used for exploration, and it could reasonably be the case that you want to see how the eigen structure changes as a result of dropping one or more variables. This is not unethical because there is no 'right' solution for prinicpal components analysis as it is usually practised.
As Richard says you can use it as an approximation to a common factor analysis, but such models are not identified (down to a rotation of the eigenvectors). Much work has been done in choosing rotations that make the trransformed eigenvectors easier to interpret: for example varimax rotations and orthomax rotations. You can even use Procrustes rotations which rotate into maximum conformity with a target.
There is nothing God given about the PCA solution for summarising a set of variables with a low dimensional summary. It happens to choose sets of variables that maximise the variance subject to orthonormalisation constraints - but this criterion is pretty arbitrary. Other approaches such as projection pursuit (Friedman, J. H. and Tukey, J. W. (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Trans. Comput, 23.) or independent components analysis work by maximising the negative entropy of the distribution of the components. They are computationally more intensive, but can often be more informative. Maximising the negative entropy can be justified as generating a component with a distribution which is, in some sense, maximally different from normal. You are free to choose any optimality criterion you want, and then do the maths to generate a solution. The advantage of PCA and ICA is that the maths is already done and the code is already freely available.
The point is, you are free to explore you data to generate hypotheses. Now Susanne *does* have a point in that it is certainly possible to select the transformation and technique to approximate the answer that you want to get. It all depends on where you are in the research process. If you are at a stage where you have well formulated models of the system, and you want to test them, you probably shouldn't be using principal components analysis at all. Instead you be be looking at Lisrel type models (Jöreskog, K. (1973). A General Method for Estimating a Linear Structural Equation System. In Goldberger, A. and Duncan, O., editors, Structural Equation Models in the Social Sciences. Academic Press.).
If you are not at that stage, then you are free to explore - the ethical constraint is not to try to represent an exploratory hypothesis generating study as a definitive hypothesis testing study.
Ideally, if the amount of total sample variance explained is greater 80 then it could be satisfactory. However, this method is subjective, thus it depends on researchers experimental requirements. Also, the scree plot can be used to determine appropriate number of PC. Thus, we look for an elbow in the scree plot which the remaining eigenvalues are relatively small and all about the same size. Finally, one could use the eigenvalue greater than one rule, in which PC that has eigen values greater than one are retain for interpretation.
Its depend upon the face database you have selected. Suppose for a particular face database you can get maximum recognition rate for this dimension 112x3 = 98%, 112x4 = 97% and 112x5 = 98%.
you have to select the choice 112x3.
For example, two face databases of having same sample size 100 per class and no. of classes 100, totally you have 10,000 faces. Face image of size 50x50. For one database you can get maximum recognition rate at 50x3 another one you get at 50x4. This because of the nature of face image captured, location,illumination foreground and background, occlusion etc.