In which case data need to be normalized before PCA, Cluster... analysis?

Dear Kouakou Valentin Koffi,

One of the assumptions of PCA is the linearity (Assumes the data set to be linear combinations of the variables). In this case, the variables that are random variables should have a Gaussian distribution. If it is not, then the data (the values of the random variables) should be normalized using the normalizing transformations. We usually use the Johnson transformation (the Johnson translation system) for normalizing the data. Unlike other transformations (for example, Box-Cox transformation) for the Johnson translation system we can determine the suitable family (function) of the system for normalizing by kurtosis and skewness squared of the data distribution until normalizing.

Luis F. Gouveia

Dear Kouakou,

I guess you are referring to standardization (normalization and standardization are often used interchangeably, but can have different meaning depending on the context). As in many other multivariate procedures one needs to ensure that no extra weight is given to the “larger” variables regarding the “smaller” ones which otherwise will lead to biased outcomes. By large and small variables I’m referring to the “numbers” you have in your dataset: if Var_A is 1.25, 1.38, 1.05 …it wil be viewed as small when compared to Var_B: 1250, 1380, 1050… which show larger values (average, variance etc will be larger for Var_B than for Var_A. If we rescale all the variables in such a way that they present equal (or similar) average and range (or variance) we are avoiding the pitfall that would occur when using simultaneously variables with different magnitudes. A common way to standardize is to center and rescale the original variable values so they lie in the range (-1, 1) by using y_standardized=(y_raw-y_min)/(y_max-y_min)*2-1.

Another very used procedure is to normalize the original data in order to get a centered, rescaled by subtracting the mean and divide by the standard deviation (often called “autoscalling”): y_standardized=(y_raw-y_average)/y_stdev.

You will find, for sure, info on wiki's and other web stat/chemometric resources.

Regards, Luis

Kouakou Valentin Koffi

Thanks you very much Sergiy and Luis for your answers. In fact I am analysing some hydrochemistry data and in the litterature I found what I mentionned. I mean some did both standardization and normalization and some made the analysis with the raw data (without any transformation) but they did not well explain the choice. But your explanations have make it clear to me. Thank you very much.

Luis F. Gouveia

Dear Kouakou,

Glad it helped and that you clarified the issue. If you found my(and others) answers valuable I kindly ask you to "Vote as an interesting answer" so we get also ranked by our contributions.

Best regards

Luis

Michael Kriegsman

For PCA, you may choose to center and/or scale your variables/columns. (It is also possible to center or scale the observations/rows, but this is uncommon.) These decisions depend on what properties of your data you'd like to keep intact.

To center a column is to subtract the column mean from each value, which sets the column mean to 0. If a variable has units where zero is meaningful to you, you may not what to center. For example, in a PET scan, the units are an amount of radioactivity, which is a count of metabolic activity. In this example, zero (and the distance from zero) is meaningful. In addition, brain region A may have a very high mean, whereas brain region B may have a low mean. If you see this difference in means as noise, centering will remove it, and then each column will have the same mean, 0. If you see this difference in means as signal, then centering would destroy this important part of your data.

To normalize is to divide each column by some value so that each column has the same importance. There are many ways to normalize (or scale or standardize, for me these are synonyms) a column. To divide a column by it's standard deviation (if you've already centered) transforms the column into Z-scores. To scale columns to have sum of squares of 1 also has nice properties. The same way that centering was a question of whether a difference in column means is meaningful to you, normalizing is a question of whether the differences in the magnitude (importance) of columns is meaningful to you. A simple rule relates to the units. If your variables have different units, dollars and yen, you probably want to normalize in order to give them equal importance, or else yen would contribute far more variance and would dominate the results. Normalizing gives columns equal variability, and therefore equal importance. If your variables have the same units, you may not want to normalize, because then differences in variability might be meaningful.

Whether (or how) to normalize can be tricky when you suspect that variables may have different amounts of variability because some variables are noisy. For example, in fMRI, high variability (large magnitude) in a brain region could be due to signal (important changes in activity over time), or could be due to noise (proximity to a large vein). It is tricky how to proceed when different variables bring different amounts quality information, because scaling will give equal importance to a variable that is a small source of noise and to another variable that is a large source of signal. This would boost noise and shrink signal, and would give untrustworthy results.

In general, in PCA, columns are centered and normalized, but it's important to think about before just accepting these defaults.

Can you recommend an equation to predict the maximum slide thickness in function of the propagation distance, slope angle and slide properties?

Which is your experience with immersive virtual reality in medical education?

How to prepare metals stock solution (for Cr, Zn) ?

Seeking Suggestions for Finance Thesis Topic,do you have any ideas?

TTEM and Deep learning inversion?

How does it go about giving proof since there is no link to that effect?

Can you recommend an analytic model for the runouts of dry granular slides in function of the internal and basal friction angles?

Upper Cretaceous planktonic and benthic foraminifer from Costa Rica, a good recognition?

Any idea of what this Upper Cretaceous foraminifera form could be ?

Request for MT and TEM data ?

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How are iso-frequency contours plotted?

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Why does my protein refolded to beta sheet during thermal denaturation analysis?