PCA is used to reduce dimensionality; the question is how to guess how many components are enough to represent the data? The change of cumulative sum of the Eigen values sorted ascendingly is suggested. Is there any other way to do that?
Following publication will be a good lead which covers most of the conventional stopping rule techniques:-
Stopping Rules in Principal Components Analysis: A Comparison of Heuristical and Statistical Approaches by Donald A. Jackson, Ecology, Vol. 74, No. 8 (Dec., 1993), pp. 2204-2214
There are a few heuristical approaches to this end. In my experience, the most used in machine learning is the one you cite: choose a total amount of variance to be preversed, and fine-tune it using a cross-validation scheme. Another popular method is the so-called Kaiser-Guttman method, i.e., keeping only the factors with corresponding eigenvalues > 1 (however, this has a few theoreotical shortcomings). A famous paper that compared the most common approaches at the time is "Stopping Rules in Principal Components Analysis: A Comparison of Heuristical and Statistical Approaches" by Jackson, a newer one (with a lot more methods) is "How many principal components? Stopping rules for determining the number of non-trivial axes revisited" by Peres-Neto et al. This last one goes into a lot of details about each method, so I think it is the best answer to your question.
Dimensional Reduction using PCA is a kind of feature extraction approach. According to my experience, if the percentage of cumulative sum of Eigen values can over 80% or 90%, the transformed vectors will be enough to represent the old vectors. However, sometimes, it is very difficult to reduce the number of dimensions dramatically, when the original data dimension number is very large. Anyway, the first transformed feature with the largest Eigen value is the feature with greatest feature discrimination ability. You should notice that all the features produced by PCA are new features, which may have different meanings from the original features.
Apart form feature extraction, another approach which can not only remain the original meanings for features, but also reduce the feature dimensions is feature selection. There are also many different ways of that, like mRMR, Correlations and so on.
I suggest you can use some hybrid ways to cope with the feature dimensional reduction problems.
As you write yourself, you can use the sorted eigenvalues to know how may PCs contain N% of the variation of the data. The problem is that it is not a priori clear what the correct value of N is. Is 50% enough? Or do you need at least 90%?That depends on your application and consequently your question is impossible to answer without further information about the intended you. However, there is one thing you can/should always do if possible: Just make the pots. Make scatterplots of your data with PCn verses PCm along the axes, for many n and m, and look. If you understand the data, than typically you're done and the question is implicitly answered. Aim for understanding, not numbers.
One way to obtain appropriate PCs as suggested by Velicer (1976) is to use minimum average partial (MAP) method.
Implementation: The first PC is removed from correlation matrix and then the average squared of components in off-diagonal is computed. In the next step the first two PCs are removed and the same process is repeated, and so on. After N times, first N- PCs are removed from the correlation matrix and the above-mentioned processes repeated again. Obtained minimum of squared average value is a factor for selecting optimum PCs and implementing this method.
Please see the paper for its implementation: Velicer WF, (1976). Determining the number of components from the matrix of partial correlations. Psychometrika 41(3):321-327.
You may use validated procedure like Parallel analysis and Velicer's minimum average partial (MAP) test. A good article on that is available: O’connor, B. (2000). SPSS and SAS programs for determining the number of components using parallel analysis and Velicer’s MAP test. Behavior research methods, instruments, & computers, 396–402.
It is precisely for these situations. It has the classical solutions by Jackson, Perez-Neto, Dijksterhuis, Manly, Jolliffe and others, together we algorithms of my own. Download it for free together with the Matlab scripts I wrote to implement everything. Not only you get to choose the significant PCs as also the significant loadings from each PC with the option of setting the non-significant loadings to 0. When you do this you will end up with PCs close to an orthogonal factor rotation. These will improve a lot the power of any test you may perform upon the extracted z scores.
In the answers below you find a lot of crap. Particularly when someone tells you to use pcs until a predetermined comulative amount of variation, eigenvalues bigger then 1, or any other threshold. An example: if you randomly generate your data you will still inevitably find eigenvalues bigger than one and you will be able to rank eigenvalues until a comulative sum of 80%; but it means nothing!!! Other crap is to use bootstrap techniques. This is a class of randomization tests developed to estimate confidence intervals and NOT testing hypothesis. They do not even set a null hypothesis.
In Computational Ecology and Software you can also find 2 more articles I published in 2013 comparing permutation tests and bootstrap techniques applied upon RMA (but work equally well upon PCA).
1) When you have many more variables than observations your smaller eigenvalues will tend to be infinitesimal. Then, the numerical algorithms for their estimation start having precision problems. Yet, this is a false problem with a very simple solution: transpose your original data matrix. Observations become variables and variables becom observations. At first it may feel un-natural, but once you understand what happened and start interpreting the results you will arrive at the same conclusions! And that is what matters. The mathematics and statistics are just means to an end.
which takes us to the next point:
2) Do not mistake precision and accuracy for dogma, particularly concerning the significance alfa level. An example: suppose you were developing a new treatment which would cure cancer or Alzheimer, or whatever, and your test would yield an alfa of 5.1%. What would you do? Would you give up on your research? I hope not! An alfa level is just a probability of being wrong when taking a decision or making a statement. But sometimes it might be worth taking the risk of being 5.1% or 6% wrong. Mathematics and machines are only a very helpfull tool to help you taking a decision. But whether it is worth taking it (or not) always has to be a human decision. And the best human to do it should be the one with a life long experience working on that subject.
Hope this helps ... and check the books by Manly and by Jackson (find them on my bibliography)
Following publication will be a good lead which covers most of the conventional stopping rule techniques:-
Stopping Rules in Principal Components Analysis: A Comparison of Heuristical and Statistical Approaches by Donald A. Jackson, Ecology, Vol. 74, No. 8 (Dec., 1993), pp. 2204-2214
The number of PCA components you'll need depends on the initial number N of dimensions of the experimental space and from the variance of the error (usually given by the PCA available softwares) that you accept in order to get the reduced space dimension N lower than the initial (M
It depends on many factors, the dimension of the data itself, how the variables are correlated, the distribution of the data, and what application you want to use the data for.
N.B. I am not a specialist in this domain but as I read some articles I could came with the above conclustion.
The article suggested by Parminder Singh Reel above is very good one. The updated version is "How many principal components?stopping rules for determining the number of non-trivial axes revisited" (Pedro et al. 2005)
Its depend upon the face database you have selected. Suppose for a particular face database you can get maximum recognition rate for this dimension 112x3 = 98%, 112x4 = 97% and 112x5 = 98%.
you have to select the choice 112x3.
For example, two face databases of having same sample size 100 per class and no. of classes 100, totally you have 10,000 faces. Face image of size 50x50. For one database you can get maximum recognition rate at 50x3 another one you get at 50x4. This because of the nature of face image captured, location,illumination foreground and background, occlusion etc.