How do I determine the right number of PCs to use in data analysis

Hi Ekene,

This depends totally on the application.

1. If you are doing PCA as a pre processing step to supervised learning, than the optimal number of PCA dimensions should be chosen by cross validation. I am a fan of five times repeated 5-fold cross validation.

2. If you are using PCA as an unsupervised method to explore and visualize the data then several options exist:

-a. as Clément suggested a hard cap of a certain variance explained like 80% or even 95%

-b. construct a scree plot: variance explained (or eigenvalues) ~ number of dimensions. As one moves to the right, toward later components, the eigenvalues drop. When the drop ceases and the curve makes an elbow toward less steep decline, Cattell's scree test says to drop all further components after the one starting the elbow.

-c. Kaiser criterion: The Kaiser rule is to drop all components with eigenvalues under 1.0.

-d. Horn's parallel analyses - which I am a fan of. Horn's method contrasts

eigenvalues produced through a PCA on a number of random

data sets of uncorrelated variables with the same number of

variables and observations as the experimental or observational

data set to produce eigenvalues for components that

are adjusted for the sample error-induced inflation. Components

with adjusted eigenvalues greater than one are retained.

More detail: http://pdxscholar.library.pdx.edu/commhealth_fac/27/

Here is a decent post on how to perform it in R: https://www.r-bloggers.com/determining-the-number-of-factors-with-parallel-analysis-in-r/

regards,

Milan

Georges Kogge Kome

Hello Ekene, remember that when using PCA, your objective is to reduce the number of variable or better still "group" the variables into a smaller number such that loss of information is minimal. Actually, the number of PCs to consider are those that will explain the variability at a very high degree. For example if the first three PCs can explain more than 80% of the variation, then consider three PCs. If four PCs can explain that, then consider four. However, given that you have up to 25 variables, try to limit the PCs to a maximum of four or five, provided they explain the variability to a very high extent (say > 80%).

How do I go about getting primary data for my research ?

Please i need research suggestions on digital forensics for my final year project?

Using Boundary-Line Approach (BLA), what is the minimum number of samples that can be used in this approach?

What is the best way to measure bush size of currant and haskap?

Optimum tissue nutrient norms

Can you help me with the result of the EM algorithm using SPSS?

Does anyone have a working protocol for the detection of Wolbachia in tsetse flies by PCR using VNTR primers?

Which Scopus Journal provides the most affordable fees?

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

Who will be moral responsible for the death of thousands of people in the event of an earthquake?

Are there any instruments for studying time similar to the way it is in space?

Are there any good simple systems or platforms to recommend?

In the case of a wound l recurrence after radical breast cancer and sentinel lymph node biopsy. Are the sentinel lymph node procedure recommended?

Regarding a model for simulating battery charge and discharge, what do you consider to be high fidelity?

Interested in a SCOPUS collaboration?

Senescence-associated beta galactosidase staining is False Positive in control group?

Are current regulations effective in preventing cancer caused by toxins?