How important is transforming the data before PCA?

08 August 2014 7 7K Report

Before PCA analysis, it is recommended to transform your variables

Hi Abdallah,

The aim of the data pretreatment (transformation and preprocessing) before of PCA or other multivariate analysis is to remove mathematically the sources of unwanted variations. These variations cannot be removed naturally during the data analysis.

Sincerely

Abdallah Samy

Thanks Acácio. It is a good answer and it help a lot. Do you work on medicinal plants?

Best regards

AMS

Dominique Desbois

The Principal Component Analysis (PCA) is a method based on spectral analysis of the matrix of coefficients of linear correlation. The principal components are linear combinations of the original variables of the data table analyzed. This descriptive method has been developed for the detection of linear relations between variables.

However, if the relationships between the variables analyzed are not linear, the values of correlation coefficients can be lower. Thus, it is sometimes useful to transform the original variables prior to the Principal Component Analysis to "linearize" these relationships.

One can of course use standard transformations (logarithm, power, Box-Cox) to linearize, thus approaching the multidimensional normal model. Note also that these data transformations change the metric used: Correspondence Analysis is a principal component analysis of the conditional distributions table where the Euclidean distance between profiles is the distance of the Chi-Square on counting data or frequency. Many other linear (eg, discriminant analysis, rank-PCA) or curvi-linear (logarithm, Box-Cox transform) methods can be characterized by an adequate processing of the table data corresponding to a particular metric (eg, metric Mahanalobis, Spearman correlation coefficient).

However, these transformations are not always optimal, so it is advisable to use piecewise polynomial (splines) to better approach the regular functions than the too rigid polynomial functions. This kind of PCA methods are called Spline based PCA (B-spline for positive coefficients) in case of independent transforms of each variable or Kernel PCA. in case of the transform of the whole data table.

For an introduction to this type of spline analyses, see:

Models for Multivariate Data Analysis 1 Introduction, by PC Besse

www.math.univ-toulouse.fr/~besse/pub/mda.ps

and also:

Insight Dreamed of a PCA, by P. C. Besse.

www.math.univ-toulouse.fr/~besse/pub/dreampca.ps

A more complete treatment of the Spline based PCA is provided by:

PCA-based Dimension Reduction for Splines, by A. van der Linde

at:

http://www.math.uni-bremen.de/~avdl/download/papers/pca_net.PDF

Abdallah Samy

Hello Michael;

I agree but you must use the log transformation with a lot of caution. Sometime, you do not have to transform your data.

Andrew Ekstrom

If you have an experiment where the range of values are significantly different, say voltage, concentration, and pH, normalizing the data before PCA is essential. If you rescale a voltage from 1.0V 2.0V to # of electrons, PC 1 will be # of electrons. If your pH is rescaled to [H+], it will be PC 2. If conc is scaled to number of molecules, it will become the new PC 1. If you normalize your data, the scale does not matter. The PC's will be what they should be.

Some claim that normalizing some data is not needed because the variability is about the same for all of your data. If this is true. PCA will give the same results if you normalize or not. So, normalizing your data will have no effect on the result.

Zenon Gniazdowski

Maybe this will help: https://www.researchgate.net/publication/319469038_New_Interpretation_of_Principal_Components_Analysis?

Article New Interpretation of Principal Components Analysis

Daniel Forster

For PCA preparation, when the data is already on a Log scale, such as pH data, wouldn't Log-transforming it be pointless? Or would it be carried out to make it match the other variables?

How much data is available on Kyasanur Forest Disease (KFD) in India?

I am already work on this one! could you add my name to the list?

How to be a collaborator on this network?

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How are iso-frequency contours plotted?

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?