As others have noted, people often transform in hopes of achieving normality prior to using some form of the general linear model (e.g., t-test, ANOVA, regression, etc). But I fear that in many cases, people make two mistakes when doing so:
1. They look at normality of the outcome variable rather than normality of the errors. For OLS models, it is the errors that are assumed to be independently and identically distributed as normal with mean = 0. (Some people also assume that explanatory variables in regression models must be normally distributed. But that is clearly incorrect. For example, an OLS linear regression model with one dichotomous explanatory variable is equivalent to an unpaired t-test, which is a perfectly good model.)
2. They overestimate the importance of the normality assumption. Or putting it another way, they underestimate the robustness of OLS models to non-normality of the errors. (And in reality, they are never truly normal anyway. As George Box noted, normal distributions and straight lines don't exist in nature; but they are still useful approximations to the statistician.)
In the context of OLS models, transformations are more often about stabilizing the variance, it seems to me--e.g., the log transform when the SD is proportional to the mean. But in some contexts, one may transform to obtain a test statistic that has an approximately normal sampling distribution--e.g., the sampling distribution of the odds ratio (OR) is not normal, but the sampling distribution of ln(OR) is asymptotically normal with SE = SQRT(1/a + 1/b + 1/c +1/d) where a-d are the 4 cell counts in the 2x2 table.
Here is a nice note on transformation of data that you may find helpful.
Mainly when data not following normality assumptions we transform it to get normality. There are multiple type of transformations like sqroot, qubic,inverse etc. You can use 'compute' option in SPSS to make transformed variables
As others have noted, people often transform in hopes of achieving normality prior to using some form of the general linear model (e.g., t-test, ANOVA, regression, etc). But I fear that in many cases, people make two mistakes when doing so:
1. They look at normality of the outcome variable rather than normality of the errors. For OLS models, it is the errors that are assumed to be independently and identically distributed as normal with mean = 0. (Some people also assume that explanatory variables in regression models must be normally distributed. But that is clearly incorrect. For example, an OLS linear regression model with one dichotomous explanatory variable is equivalent to an unpaired t-test, which is a perfectly good model.)
2. They overestimate the importance of the normality assumption. Or putting it another way, they underestimate the robustness of OLS models to non-normality of the errors. (And in reality, they are never truly normal anyway. As George Box noted, normal distributions and straight lines don't exist in nature; but they are still useful approximations to the statistician.)
In the context of OLS models, transformations are more often about stabilizing the variance, it seems to me--e.g., the log transform when the SD is proportional to the mean. But in some contexts, one may transform to obtain a test statistic that has an approximately normal sampling distribution--e.g., the sampling distribution of the odds ratio (OR) is not normal, but the sampling distribution of ln(OR) is asymptotically normal with SE = SQRT(1/a + 1/b + 1/c +1/d) where a-d are the 4 cell counts in the 2x2 table.
Here is a nice note on transformation of data that you may find helpful.
Hi Rihab, Bruce's answer is really outstanding. The purpose of transformation in most instances is not merely to take a variable that is non-normal and bring it to normality, it is to try to meet the assumptions of a statistical test or procedure, which you would review when using such a procedure, which in one way or another has to do with the errors (e.g., residuals). For the most part, when the assumptions aren't met the standard errors are biased, and because the standard errors are generally used in getting to the p-value, we might reach a faulty conclusion regarding the null hypothesis. So when we see that we are not meeting the assumptions of a given test or procedure, and the problem would appear to be the distribution of a variable we are using, then we often try transformations, although alternatively we can try a different test or procedure that might have different assumptions or be more robust. In helping to choose how to transform a variable, you might find the term "Tukey's ladder" to be a useful search term, as the great mathematician John Tukey created an ordered list of transformations to use to help bring skewed distributions toward normality. But again, in simple cases, it might make sense to use a test that say converts the raw values to ranks (as many nonparametric tests do) and sidesteps some of the problems that a skewed distribution may be causing with some parametric test, but if you need something more complex, such as multiple regression, a Tukey-style transformation may help you meet the requirements for the residuals that you cannot meet with the original, untransformed variable. Bob
Hi, I think Bob has hit the point in that transformations are used to match the assumptions. But I miss a reference to the sample space, which should be the first thing to check. Normality assumes that the sample space of the random variable under study is the whole real line, which is hardly ever the case.
"So when we see that we are not meeting the assumptions of a given test or procedure, and the problem would appear to be the distribution of a variable we are using, then we often try transformations, although alternatively we can try a different test or procedure that might have different assumptions or be more robust."
An example of what Bob says here would be using the Welch-Satterthwaite (unequal variances) t-test when one has heterogeneity of variance (especially if it is in combination with very discrepant sample sizes). SPSS also includes unequal variances versions of one-way ANOVA in its ONEWAY procedure. But what is not so well known is that nowadays, one can use procedures for performing multilevel modeling (e.g., the MIXED procedure in SPSS) to allow for heterogeneous error variances in more complex ANOVA or ANCOVA-like models. IMO, this provides a very attractive alternative to transformation in many cases.
Also in regression analysis sometimes the transformation will be important, Linear least squares regression assumes that the relationship between two variables is linear. Often we can “straighten” a nonlinear relationship by transforming one of the variables or more. This URL will be helpful for you.
1.1. left tail (left skewed): square the variable or cube. This raises the variable value to a power of more than 1.
1.2. right tail (right skewed): square-root or log or reciprocate the variable values.
2.0 To run that in SPSS: Transform > Compute variable... > follow on-screen instruction to transform the variable.
3.0 If this transformation fail to achieve normality, opt for Box-Cox transformation which uses lambda value to run. It's not a straightforward data transformation, but that should be your last resort. Find it here: http://pareonline.net/pdf/v15n12.pdf
4.0 If everything else fail, consider non-parametric tests.
When some variables are nearly normally distributed and some show higher differences in their values, therefore, to make them nearly normally distributed, it is sometimes needed to take their log. If differences in values are not so wide, one can take square roots or cube roots of data. However, sometimes it is dictated by function as Cobb–Douglas production function.
Rehab,
Apart from good suggestions by Bruce and other, sometimes, a transformation of data is required when on the basis of several variables, one wants to calculate a cumulative index to represent some construct or concept. As such, data should be additive. Since generally data are in different metrics (standard measurement units), they cannot be added. Therefore, to make them additive, they are transformed such that they become scale-free. There are a number of methods to make data scale-free. However, commonly z-score method is used. It transforms variables in a way that their means are zero and variance is unity (one).
I agree with Bruce's answer when transformations are intended to approximate a normal distribution of the variable or residuals in a linear model. However, I think that there are deep reasons for transforming data in many circumstances not related with normality of the variables. Most statistical methods and models are designed for real data: this means that their values are assumed positive or negative, that the linear operation between variables is the sum, that scaling is multiplying by positive constants and that differences are computed by the ordinary subtraction, i.e. they are assumed to have absolute scale. This is the structure of the sample space. The main reason for transforming a random variable and/or the sample values is to make the transformed values compatible with the implicit assumptions of the statistical analysis of real data and its sample space.
One of the simplest and frequent case are that of positive variables in a ratio scale: the zero value is not attainable, differences are measured by ratios, e.g. 2 is double of 1 (2/1=2), but 1000 and 1001 are considered as almost equal (1001/1000 approx 1) although the ordinary (Euclidean) difference is equal to 1 in both cases. In such cases, taking log's of the data can be assumed as a change of the ratio scale to the absolute scale. And this is done independently of the fact that the resulting values seem to be normally distributed.
As a last comment, when the sample space of a variable is limited (positive, interval) there is a theoretical impossibility of that variable being normally distributed, even when the normal distribution may be a good approximation of the true distribution. Transformations like log for positive data, or logit for proportion data, may be do not make the distribution to be the normal one, but at least make the normality possible.
I would like to insist in the final comment of Egozcue: when the variable has a constraint support, i.e. is restricted to be positive or has to lie in an interval, then it is impossible that it follows a normal distribution, because the support of the normal distribution is the whole real line, going from minus infinity to plus infinity.
It goes even further than what Vera says, if you believe what George Box said in section 2.5 (Role of Mathematics in Science) in his classic article, Science and Statistics (see link below). Here's the excerpt I have in mind (with emphasis added).
In applying mathematics to subjects such as physics or statistics we make tentative assumptions about the real world which we know are false but which we believe may be useful nonetheless. The physicist knows that particles have mass and yet certain results, approximating what really happens, may be derived from the assumption that they do not. Equally, the statistician knows, for example, that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world.
That is a good reference, Bruce! Nevertheless, accepting an only approximate solution, it is good that at least the other assumptions hold. I mean the assumptions related to take real space as sample space, and in particular the absolute scale when dealing with real random variables. Transformations like the log help transforming e.g. a ratio scale into an absolute scale.
"So when we see that we are not meeting the assumptions of a given test or procedure, and the problem would appear to be the distribution of a variable we are using, then we often try transformations, although alternatively we can try a different test or procedure that might have different assumptions or be more robust."
If for example, the p-value of my model is not significant (one of the most important assumption of linear regression) can I use Bob's argument to justify the use of log transformation?
Quick note on application: I think the power transformation (i.e., Box-Cox) should be mentioned. There is plenty, readily available sources on this.
I think your original question is an important one, and there are many good answers above. I'm afraid I can't contribute much more useful information as to why the analyst would transform data (assuming it's the response variable that's being transformed), but I will offer a glance at some alternative procedures when the analyst might suspect assumptions of a "classic" (OLS-GLM) tests are unreasonably violated. Some have already alluded to alternatives to transformation (see below), so I hope this is still within the scope of the original question/information sought.
e.g., Robert's comment which was highlighted by Bruce (my emphasis), "So when we see that we are not meeting the assumptions of a given test or procedure, and the problem would appear to be the distribution of a variable we are using, then we often try transformations, although alternatively we can try a different test or procedure that might have different assumptions or be more robust."
e.g., Another comment from Robert, "... it might make sense to use a test that say converts the raw values to ranks (as many nonparametric tests do) and sidesteps some of the problems that a skewed distribution may be causing with some parametric test..."
I want to first reiterate some of Robert's comments: Transformations are typically used to satisfy an assumption(s) of a statistical test—assuming we've been referring to classical tests based on ordinary least squares (e.g., ANOVA). As Bruce mentioned, we should be making these assumptions about the residuals of a fitted model. Namely, observations should be independent (i.e., lack of autocorrelation or pseudoreplication), be homoskedastic (i.e., [near] equal variance), and follow a normal (Gaussian) distribution. Transformations really can't help if your data aren't independent; this probably depends more on the design of the experiment and sampling scheme. What they can do, Bruce and Robert mention this, is help meet the assumptions of, more importantly, homoskedasticity and normally distributed. One thing to keep in mind when transforming is interpretation, graphing, and reporting results may not be as straight forward.
If you, the analyst, fear that assumptions have been unreasonably violated and your choose not to transform or transforming will not improve the analysis, you might consider the following (not an exhaustive list, just of what I am aware):
Non-independence
Autocovariate
Repeated-measures analysis
Mixed-model framework
Permutation (sometimes)
Heteroskedasticity
Corrections to denominator degrees of freedom (e.g., Welch's, Satterthwaite, Kenward-Rogers)
Generalized linear models (if you can approximate to known/common distribution, e.g., lognormal, Poisson)
Non-normally distributed
Generalized linear models (same as above)
Permutation, bootstrapping
Nonparametric approaches
You'll notice some of the alternatives fall into multiple categories. Really, many of them could fall into multiple categories depending on which assumption(s) and degree of violation to that assumption(s). You should realize too, these all have different/additional assumptions and sometimes philosophies. You'll also need to gain the technical know-how to implement and evaluate any of the above methods. It can be worth it though.
I encourage others to add, improve, or critique the above, cause I know it's a lot (each of those bullet points has enough literature to keep the analyst busy for some time!) and I may have neglected or poorly explained something.
At the risk of sounding redundant and ambiguous, know your data. Exploratory data analysis does not get enough attention. I think Zuur et al. 2010 is a good resource to have a look at. http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2009.00001.x/abstract
Cheers,
Caleb
2018-10-03 edit:
David DesRochers comment (below) made me think to add a paper (likely of interest to thread readers).
Article What's normal anyway? Residual plots are more telling than s...
While this conversation is an older one, I am finding all of these contributions amazingly helpful. I am creating an introduction to data lecture for my research methods course, and I find this conversation string to be an amazing resource. I appreciate the theory behind the offerings very helpful as well. Thank you greatly!