How do we ensure data is normally distributed before performing PCA and HCA using SPSS or R?

More Philip M Nyenje's questions See All

From probability theory, sum of f(x) = 1. Hence P(x>X) = 1 - P(x<X). However, a simple check using randomly selected values shows otherwise. Why?

Given a set of values of a variable x= 40, 25, 20, 15, 12, and 5: The probability of exceedance of a value x=20 written as P (x>=20) can be got by arranging data is descending order thus...

07 August 2016 9,569 8 View

How can I use SPSS to determine if concentrations upstream of a river are significantly different from concentrations downstream of a river?

08 September 2012 462 21 View

• What the possible Persistent Organic Pollutants and Heavy metals present in fluorspar, sediments, and water bodies around its mining area?

Approximate concentrations are require in compared with the WHO permissible limts

11 August 2024 2,723 1 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Do interactions between biosphere, carbon cycle, & water cycle impact global warming & interaction between atmosphere & hydrosphere?

How do interactions between the biosphere, the carbon cycle, and the water cycle impact global warming and interaction between the atmosphere and the hydrosphere?

09 August 2024 3,291 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

Is it true that $\det(V(A))$ may be only $\pm 1$, depending on $n$, for the last symmetric tridiagonal matrix $A$?

One can try to generalize the Vandermonde determinant in the following direction: Let $A$ be any symmetric $n$-order square matrix. Consider its powers' diagonal elements $(A^k)_{ii}$ and...

08 August 2024 6,690 1 View

André I Wierdsma Popular answer

If your dependent variable is not normally distributed, you have 4 options: (1) forget about it - Central Limit Theory will help you out; (2) tranform or exclude outliers; (3) go non-parametric; or (4) use another model (gamma, poisson, negative binomial, etc). There is less of a problem in case one of your continuous predictors is not normally distributed. Regression models only assume that there is a linear relation with the dependent. So don't kick them out but look at the relationships (graphically).

Richard James Telford

Standardising data is not the same as normalising it (although some people use the terms interchangeably).

Depending on the options you use, PCA will automatically standardise the variables for you.

For example

princomp(MyData, cor=TRUE)

will use the correlation matrix - equivalent to standardising the data to mean zero and standard deviation 1.

If the data are not normally distributed it may be necessary to transform them. Plot a histogram (or better a qqnorm plot) of each variable and look if there is marked skew or kurtosis. Log or square-root transforms are common, but there is a whole range of transformations that can be used.

Philip M Nyenje

Hello RIchard, thank you very much for your answer. I have realised that most of my variables are skewed to the right. Some of them need a logtransformation and some need a sqrt root transformation. Can I use different transformations for the same dataset or we have to use the same transformation and then ignore those variables that cannot be transformed.

I will try to read more about princomp.

Adrian Otoiu

It is usually a bad thing to transform data before doing any estimation-modelling, as data loses its properties and modelling is done on something that may not reflect the underlying phenomena. This comes from my professor James MacKinnon, and his advice is usually very good. Are you sure that data has to be normalized for PCA/HCA, I saw several analyses where this was not done. Maybe you need to consult some classic textbooks on this.

Dear Vladimir: THank you for this answer. Yes I have read several papers and they suggest that data has to be normalised. I have seen that the two criteria for testing normality are also included in SPSS: Kolmogorov - Smirnov for n>50 and the other for n

In real life this doesn't really happen. I have not seen many studies when data was first normalized, at least in regression analysis.

Vladimir Bakhrushin

First of all you have to check the normality of the data. This can be done using criteria such as omega-squared or Kolmogorov - Smirnov. You must also verify the homogeneity of the existing samples. If they are heterogeneous, the normalization has no meaning.

André I Wierdsma

Formal normality tests and graphical methods will be of limited use (see the link)

You could enter a non-parametric correlation matrix in your factor-analyses.

http://www.statisticalmisses.nl/index.php/frequently-asked-questions/77-what-is-wrong-with-tests-of-normality

Here is SPSS-syntax for scale-free nonparametric factor analysis

Andre: This is a very interesting contribution. From the link you sent, the non-parametric methods do not require normality tests. I will try to read more about the nonparametric analysis syntax you sent.

Generally. I think that data are rarely normally distributed. SOme can be transformed and some cant. However, if one kicks out a certain variable (s) on the basis that it (they) violates the requirements for normality, there is a possibility of missing out important processes that could be explained by that variable. This could be an important limitation of multivariate statistics in understanding hydrological processes.