Why do we do transformation before data analysis?

Rihab Mohamed Abdelrahman @Rihab_Abdelrahman2

05 May 2014 26 7K Report

What are the situations in which we need to do transformation and how can we do it?

Bruce Weaver Popular answer

As others have noted, people often transform in hopes of achieving normality prior to using some form of the general linear model (e.g., t-test, ANOVA, regression, etc). But I fear that in many cases, people make two mistakes when doing so:

1. They look at normality of the outcome variable rather than normality of the errors. For OLS models, it is the errors that are assumed to be independently and identically distributed as normal with mean = 0. (Some people also assume that explanatory variables in regression models must be normally distributed. But that is clearly incorrect. For example, an OLS linear regression model with one dichotomous explanatory variable is equivalent to an unpaired t-test, which is a perfectly good model.)

2. They overestimate the importance of the normality assumption. Or putting it another way, they underestimate the robustness of OLS models to non-normality of the errors. (And in reality, they are never truly normal anyway. As George Box noted, normal distributions and straight lines don't exist in nature; but they are still useful approximations to the statistician.)

In the context of OLS models, transformations are more often about stabilizing the variance, it seems to me--e.g., the log transform when the SD is proportional to the mean. But in some contexts, one may transform to obtain a test statistic that has an approximately normal sampling distribution--e.g., the sampling distribution of the odds ratio (OR) is not normal, but the sampling distribution of ln(OR) is asymptotically normal with SE = SQRT(1/a + 1/b + 1/c +1/d) where a-d are the 4 cell counts in the 2x2 table.

Here is a nice note on transformation of data that you may find helpful.

http://fmwww.bc.edu/repec/bocode/t/transint.html

Willy Yee

Hi Rihab,

I assume you are interested in performing an ANOVA. Data transformation can be performed when:

1. Your data does not fit in a normal distribution curve. This can be tested using the shapiro-wilk test in SPSS.

2. The variance of your data is not homogeneous (p

Mahfuz Judeh

We usually transform information for many purposes, such as recode, compute, if, and weight. With compute, as an example,you can create new variables.

Arjumand Rizvi

Mainly when data not following normality assumptions we transform it to get normality. There are multiple type of transformations like sqroot, qubic,inverse etc. You can use 'compute' option in SPSS to make transformed variables

Bruce Weaver

Here is a nice note on transformation of data that you may find helpful.

http://fmwww.bc.edu/repec/bocode/t/transint.html

Guillermo Enrique Ramos

Dear Rihab

You may find of interest the question:

https://www.researchgate.net/post/Which_one_is_a_better_analysis_nonparametric_analysis_or_the_analysis_of_transformed_data

I hope this help.

Robert Thomas Brennan

Hi Rihab, Bruce's answer is really outstanding. The purpose of transformation in most instances is not merely to take a variable that is non-normal and bring it to normality, it is to try to meet the assumptions of a statistical test or procedure, which you would review when using such a procedure, which in one way or another has to do with the errors (e.g., residuals). For the most part, when the assumptions aren't met the standard errors are biased, and because the standard errors are generally used in getting to the p-value, we might reach a faulty conclusion regarding the null hypothesis. So when we see that we are not meeting the assumptions of a given test or procedure, and the problem would appear to be the distribution of a variable we are using, then we often try transformations, although alternatively we can try a different test or procedure that might have different assumptions or be more robust. In helping to choose how to transform a variable, you might find the term "Tukey's ladder" to be a useful search term, as the great mathematician John Tukey created an ordered list of transformations to use to help bring skewed distributions toward normality. But again, in simple cases, it might make sense to use a test that say converts the raw values to ranks (as many nonparametric tests do) and sidesteps some of the problems that a skewed distribution may be causing with some parametric test, but if you need something more complex, such as multiple regression, a Tukey-style transformation may help you meet the requirements for the residuals that you cannot meet with the original, untransformed variable. Bob

Vera Pawlowsky-Glahn

Hi, I think Bob has hit the point in that transformations are used to match the assumptions. But I miss a reference to the sample space, which should be the first thing to check. Normality assumes that the sample space of the random variable under study is the whole real line, which is hardly ever the case.

Vera

Bruce Weaver

Bob wrote:

"So when we see that we are not meeting the assumptions of a given test or procedure, and the problem would appear to be the distribution of a variable we are using, then we often try transformations, although alternatively we can try a different test or procedure that might have different assumptions or be more robust."

An example of what Bob says here would be using the Welch-Satterthwaite (unequal variances) t-test when one has heterogeneity of variance (especially if it is in combination with very discrepant sample sizes). SPSS also includes unequal variances versions of one-way ANOVA in its ONEWAY procedure. But what is not so well known is that nowadays, one can use procedures for performing multilevel modeling (e.g., the MIXED procedure in SPSS) to allow for heterogeneous error variances in more complex ANOVA or ANCOVA-like models. IMO, this provides a very attractive alternative to transformation in many cases.

HTH.

Huda A. Rasheed

Also in regression analysis sometimes the transformation will be important, Linear least squares regression assumes that the relationship between two variables is linear. Often we can “straighten” a nonlinear relationship by transforming one of the variables or more. This URL will be helpful for you.

http://polisci.msu.edu/jacoby/icpsr/regress3/lectures/week1/4.Transformations.pdf

Rihab Mohamed Abdelrahman

Hi,

Thank you very much.The remarks all ,were very helpful to me.thank very much for the links and will open them.

Sourav Roy Choudhury

It is usually used to normalize a skewed data set prior to statistical tests. Usually Log and cuberoot transformations work well.

Saiyidi Mat Roni

1.0 Look at the tail of your data distribution:

1.1. left tail (left skewed): square the variable or cube. This raises the variable value to a power of more than 1.

1.2. right tail (right skewed): square-root or log or reciprocate the variable values.

2.0 To run that in SPSS: Transform > Compute variable... > follow on-screen instruction to transform the variable.

3.0 If this transformation fail to achieve normality, opt for Box-Cox transformation which uses lambda value to run. It's not a straightforward data transformation, but that should be your last resort. Find it here: http://pareonline.net/pdf/v15n12.pdf

4.0 If everything else fail, consider non-parametric tests.

Mahfuz Judeh

Many functions can come under the title, “Transformation”, such as:

Compute Variables: You may need to compute a new variable based on existing information (from other questions or variables) in your data.

Recode: You can create new variables with compute and you can modify the values of an existing variable with recode.

Shivesh Sareen

WHY should one take the log of the distribution ?Under what conditions does one decide to do this?

Mohammad Firoz Khan

Shivesh,

When some variables are nearly normally distributed and some show higher differences in their values, therefore, to make them nearly normally distributed, it is sometimes needed to take their log. If differences in values are not so wide, one can take square roots or cube roots of data. However, sometimes it is dictated by function as Cobb–Douglas production function.

Rehab,

Apart from good suggestions by Bruce and other, sometimes, a transformation of data is required when on the basis of several variables, one wants to calculate a cumulative index to represent some construct or concept. As such, data should be additive. Since generally data are in different metrics (standard measurement units), they cannot be added. Therefore, to make them additive, they are transformed such that they become scale-free. There are a number of methods to make data scale-free. However, commonly z-score method is used. It transforms variables in a way that their means are zero and variance is unity (one).

Juan Jose Egozcue

I agree with Bruce's answer when transformations are intended to approximate a normal distribution of the variable or residuals in a linear model. However, I think that there are deep reasons for transforming data in many circumstances not related with normality of the variables. Most statistical methods and models are designed for real data: this means that their values are assumed positive or negative, that the linear operation between variables is the sum, that scaling is multiplying by positive constants and that differences are computed by the ordinary subtraction, i.e. they are assumed to have absolute scale. This is the structure of the sample space. The main reason for transforming a random variable and/or the sample values is to make the transformed values compatible with the implicit assumptions of the statistical analysis of real data and its sample space.

One of the simplest and frequent case are that of positive variables in a ratio scale: the zero value is not attainable, differences are measured by ratios, e.g. 2 is double of 1 (2/1=2), but 1000 and 1001 are considered as almost equal (1001/1000 approx 1) although the ordinary (Euclidean) difference is equal to 1 in both cases. In such cases, taking log's of the data can be assumed as a change of the ratio scale to the absolute scale. And this is done independently of the fact that the resulting values seem to be normally distributed.

As a last comment, when the sample space of a variable is limited (positive, interval) there is a theoretical impossibility of that variable being normally distributed, even when the normal distribution may be a good approximation of the true distribution. Transformations like log for positive data, or logit for proportion data, may be do not make the distribution to be the normal one, but at least make the normality possible.

Vera Pawlowsky-Glahn

I would like to insist in the final comment of Egozcue: when the variable has a constraint support, i.e. is restricted to be positive or has to lie in an interval, then it is impossible that it follows a normal distribution, because the support of the normal distribution is the whole real line, going from minus infinity to plus infinity.

Bruce Weaver

It goes even further than what Vera says, if you believe what George Box said in section 2.5 (Role of Mathematics in Science) in his classic article, Science and Statistics (see link below). Here's the excerpt I have in mind (with emphasis added).

In applying mathematics to subjects such as physics or statistics we make tentative assumptions about the real world which we know are false but which we believe may be useful nonetheless. The physicist knows that particles have mass and yet certain results, approximating what really happens, may be derived from the assumption that they do not. Equally, the statistician knows, for example, that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world.

HTH.

http://mkweb.bcgsc.ca/pointsofsignificance/img/Boxonmaths.pdf

Vera Pawlowsky-Glahn

That is a good reference, Bruce! Nevertheless, accepting an only approximate solution, it is good that at least the other assumptions hold. I mean the assumptions related to take real space as sample space, and in particular the absolute scale when dealing with real random variables. Transformations like the log help transforming e.g. a ratio scale into an absolute scale.

Mehmet Guven Gunver

we think that you shouldnt transform any data. Check out our new paper

Article TO DETERMINE SKEWNESS, MEAN AND DEVIATION WITH A NEW APPROAC...

Norhisam Bulot

Bob wrote:

If for example, the p-value of my model is not significant (one of the most important assumption of linear regression) can I use Bob's argument to justify the use of log transformation?

Caleb A. Aldridge

Rihab,

Quick note on application: I think the power transformation (i.e., Box-Cox) should be mentioned. There is plenty, readily available sources on this.

I think your original question is an important one, and there are many good answers above. I'm afraid I can't contribute much more useful information as to why the analyst would transform data (assuming it's the response variable that's being transformed), but I will offer a glance at some alternative procedures when the analyst might suspect assumptions of a "classic" (OLS-GLM) tests are unreasonably violated. Some have already alluded to alternatives to transformation (see below), so I hope this is still within the scope of the original question/information sought.

e.g., Robert's comment which was highlighted by Bruce (my emphasis), "So when we see that we are not meeting the assumptions of a given test or procedure, and the problem would appear to be the distribution of a variable we are using, then we often try transformations, although alternatively we can try a different test or procedure that might have different assumptions or be more robust."

e.g., Another comment from Robert, "... it might make sense to use a test that say converts the raw values to ranks (as many nonparametric tests do) and sidesteps some of the problems that a skewed distribution may be causing with some parametric test..."

I want to first reiterate some of Robert's comments: Transformations are typically used to satisfy an assumption(s) of a statistical test—assuming we've been referring to classical tests based on ordinary least squares (e.g., ANOVA). As Bruce mentioned, we should be making these assumptions about the residuals of a fitted model. Namely, observations should be independent (i.e., lack of autocorrelation or pseudoreplication), be homoskedastic (i.e., [near] equal variance), and follow a normal (Gaussian) distribution. Transformations really can't help if your data aren't independent; this probably depends more on the design of the experiment and sampling scheme. What they can do, Bruce and Robert mention this, is help meet the assumptions of, more importantly, homoskedasticity and normally distributed. One thing to keep in mind when transforming is interpretation, graphing, and reporting results may not be as straight forward.

If you, the analyst, fear that assumptions have been unreasonably violated and your choose not to transform or transforming will not improve the analysis, you might consider the following (not an exhaustive list, just of what I am aware):

Non-independence

Autocovariate
Repeated-measures analysis
Mixed-model framework
Permutation (sometimes)

Heteroskedasticity

Corrections to denominator degrees of freedom (e.g., Welch's, Satterthwaite, Kenward-Rogers)
Weighted least squares
Standard error correction (e.g., Huber-White)
Bootstrapping
Nonparametric appraoches (e.g., Kruskal-Wallis, ordinal logistic regression)
Generalized linear models (if you can approximate to known/common distribution, e.g., lognormal, Poisson)

Non-normally distributed

Generalized linear models (same as above)
Permutation, bootstrapping
Nonparametric approaches

You'll notice some of the alternatives fall into multiple categories. Really, many of them could fall into multiple categories depending on which assumption(s) and degree of violation to that assumption(s). You should realize too, these all have different/additional assumptions and sometimes philosophies. You'll also need to gain the technical know-how to implement and evaluate any of the above methods. It can be worth it though.

I encourage others to add, improve, or critique the above, cause I know it's a lot (each of those bullet points has enough literature to keep the analyst busy for some time!) and I may have neglected or poorly explained something.

At the risk of sounding redundant and ambiguous, know your data. Exploratory data analysis does not get enough attention. I think Zuur et al. 2010 is a good resource to have a look at. http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2009.00001.x/abstract

Cheers,

Caleb

2018-10-03 edit:

David DesRochers comment (below) made me think to add a paper (likely of interest to thread readers).

Article What's normal anyway? Residual plots are more telling than s...

David DesRochers

While this conversation is an older one, I am finding all of these contributions amazingly helpful. I am creating an introduction to data lecture for my research methods course, and I find this conversation string to be an amazing resource. I appreciate the theory behind the offerings very helpful as well. Thank you greatly!

Bhim Singh

If data are not follows the normality test i.e. data are non-orthogonal, then we use transformation of data and data becomes orthogonal.

Caroline Stewart

Amazingly helpful discussion. Thank you all!

Badges
Science topic

More Rihab Mohamed Abdelrahman's questions See All

Who is the first author in papers extracted from an academic thesis?

hi, Who should be the first author in published papers extracted from a master or Ph.D. Thesis. Should be the supervisor of the thesis or should be the student who had done the academic thesis?

10 November 2016 7,841 26 View

Can anyone help with the removal of protein comitant from a DNA pellet?

I did extraction for a DNA from blood on a filter paper using phenol chloroform method and got a purity for many samples less than 1.8 and low concentration. I used qiagen kit for the concerned...

02 March 2014 3,014 8 View

What is the effect of a low 260/230 ratio on the purity of DNA?

I have a number of samples of DNA extracted from blood on a filter paper with low 260/230. Does this low ratio inhibit the amplification of the DNA to form a band on the agarose gel after a PCR run?

02 March 2014 1,873 22 View

What are the advantages of using a ROC curve?

I am comparing three diagnostic tests in human. How could a ROC curve help me to determine which is the best and how I could make use of all the data I get when I use SPSS?

01 February 2014 1,621 7 View

How many pages should be written in each chapter of the phD thesis?

Would you please tell me how many pages or the percentage of pages that should be written in each chapter of the phD thesis? Thank you.

31 December 2013 6,111 8 View

What lysis buffer to use to extract DNA from filter paper using qiagen kit?

I want to extract DNA from blood found in whatmann number 3 filter paper using qiagen kit. The kit lacks the buffer that lysates the blood prior to extraction. Is there a lysis buffer I could use...

11 December 2013 7,834 9 View

What is the advantage of adding dUTP and uracil nucleotide glycosylase to the PCR master mix?

I found in a paper that they added dUTP and uracil nucleotide glycosylase to the PCR mix for better performance of the PCR.

11 December 2013 4,431 7 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

How are iso-frequency contours plotted?

Let's say we have a standard, regular hexagonal honeycomb with a 3-arm primitive unit cell (something like the figure attached; the figure is only representative and not drawn to scale). The...

07 August 2024 1,937 1 View

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

A fungal strain was treated with nanoparticles. We want to do an environmental SEM analysis. So could anyone share your views on preparing the sample? Thank you.

07 August 2024 5,307 1 View

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Hi, I have a question about normalizing the MTT OD values for doing the statistical analysis. So, if we have 3 different plates and we call them 3 different replicates, so, first we would...

07 August 2024 8,106 4 View