I would like to know in which case data need to be normalized before PCA and cluster analysis. I saw some PCA analysis without any prior log-transformation and standardization and some do not.
One of the assumptions of PCA is the linearity (Assumes the data set to be linear combinations of the variables). In this case, the variables that are random variables should have a Gaussian distribution. If it is not, then the data (the values of the random variables) should be normalized using the normalizing transformations. We usually use the Johnson transformation (the Johnson translation system) for normalizing the data. Unlike other transformations (for example, Box-Cox transformation) for the Johnson translation system we can determine the suitable family (function) of the system for normalizing by kurtosis and skewness squared of the data distribution until normalizing.
I guess you are referring to standardization (normalization and standardization are often used interchangeably, but can have different meaning depending on the context). As in many other multivariate procedures one needs to ensure that no extra weight is given to the “larger” variables regarding the “smaller” ones which otherwise will lead to biased outcomes. By large and small variables I’m referring to the “numbers” you have in your dataset: if Var_A is 1.25, 1.38, 1.05 …it wil be viewed as small when compared to Var_B: 1250, 1380, 1050… which show larger values (average, variance etc will be larger for Var_B than for Var_A. If we rescale all the variables in such a way that they present equal (or similar) average and range (or variance) we are avoiding the pitfall that would occur when using simultaneously variables with different magnitudes. A common way to standardize is to center and rescale the original variable values so they lie in the range (-1, 1) by using y_standardized=(y_raw-y_min)/(y_max-y_min)*2-1.
Another very used procedure is to normalize the original data in order to get a centered, rescaled by subtracting the mean and divide by the standard deviation (often called “autoscalling”): y_standardized=(y_raw-y_average)/y_stdev.
You will find, for sure, info on wiki's and other web stat/chemometric resources.
Thanks you very much Sergiy and Luis for your answers. In fact I am analysing some hydrochemistry data and in the litterature I found what I mentionned. I mean some did both standardization and normalization and some made the analysis with the raw data (without any transformation) but they did not well explain the choice. But your explanations have make it clear to me. Thank you very much.
Glad it helped and that you clarified the issue. If you found my(and others) answers valuable I kindly ask you to "Vote as an interesting answer" so we get also ranked by our contributions.
For PCA, you may choose to center and/or scale your variables/columns. (It is also possible to center or scale the observations/rows, but this is uncommon.) These decisions depend on what properties of your data you'd like to keep intact.
To center a column is to subtract the column mean from each value, which sets the column mean to 0. If a variable has units where zero is meaningful to you, you may not what to center. For example, in a PET scan, the units are an amount of radioactivity, which is a count of metabolic activity. In this example, zero (and the distance from zero) is meaningful. In addition, brain region A may have a very high mean, whereas brain region B may have a low mean. If you see this difference in means as noise, centering will remove it, and then each column will have the same mean, 0. If you see this difference in means as signal, then centering would destroy this important part of your data.
To normalize is to divide each column by some value so that each column has the same importance. There are many ways to normalize (or scale or standardize, for me these are synonyms) a column. To divide a column by it's standard deviation (if you've already centered) transforms the column into Z-scores. To scale columns to have sum of squares of 1 also has nice properties. The same way that centering was a question of whether a difference in column means is meaningful to you, normalizing is a question of whether the differences in the magnitude (importance) of columns is meaningful to you. A simple rule relates to the units. If your variables have different units, dollars and yen, you probably want to normalize in order to give them equal importance, or else yen would contribute far more variance and would dominate the results. Normalizing gives columns equal variability, and therefore equal importance. If your variables have the same units, you may not want to normalize, because then differences in variability might be meaningful.
Whether (or how) to normalize can be tricky when you suspect that variables may have different amounts of variability because some variables are noisy. For example, in fMRI, high variability (large magnitude) in a brain region could be due to signal (important changes in activity over time), or could be due to noise (proximity to a large vein). It is tricky how to proceed when different variables bring different amounts quality information, because scaling will give equal importance to a variable that is a small source of noise and to another variable that is a large source of signal. This would boost noise and shrink signal, and would give untrustworthy results.
In general, in PCA, columns are centered and normalized, but it's important to think about before just accepting these defaults.