You say your data set is highly skewed. Do you mean that the dependent variable in your model is highly skewed?
What is the dependent variable? (Depending on what it is, it may be conventional to apply some kind of transformation. E.g., IIRC, concentrations are often log-transformed in chemistry.)
What is the sample size?
How many explanatory variables do you have?
Are you specifically interested in conditional means (of the DV)? Or would you be happy to have conditional medians (for example) as fitted values from your model?
If you have not already done so, I suggest that you estimate the model using OLS, save the residuals, and examine residual plots. Remember that for OLS, it is the error distribution (not the Y-distribution) that is assumed to be normally distributed. Remember too that normality of the errors is a sufficient condition, but not a necessary condition. The necessary condition is that the sampling distributions of the parameter estimates be (approximately) normal--and they will converge on the normal distribution as n increases, even if the errors are not normally distributed. (See the attached PDF for a nice summary of the assumptions for OLS models.) HTH.
What kind of response variable is it? If it is rates and proportions, you can use Beta regression or Unit-Lindley regression. In other continuous varaible one way to deal with skewed data is log transformation. If it does not work, you can select power transformation (box cox transformation help you to select suitable power). If you decided to use alternate of linear regression, Theil's SLR is alternate nonparametric regression of linear regression. In addition, you also can use some modern non-parametric regressions which are: linear smoother, local regression, and penalized regressions.
In establishment survey sampling, it is common to have skewed dependent data, and because there is essential heteroscedasticity in regression which is a natural occurrence when predicted-y values differ, weighted least squares regression is then very appropriate. The latest updates in each of these two projects show some examples:
https://www.researchgate.net/project/Cutoff-and-quasi-cutoff-sampling-with-prediction-for-Official-Statistics, and
and there are other examples. See other updates and references for those two projects.
In such cases, ratio models are most likely encountered, though I used multiple regression to handle cases of fuel switching in electric power plants, and in another case where there were multiple considerations. However, as long as you can write Yi = predicted-yi + ei, and you have heteroscedasticity which is often obvious with skewed data, then the situation is similar. You model using regression weights, which are not only part of the variance structure, but also involved in the estimated regression coefficients.
Log transformation, like powers, change the distribution. Can there be problems? See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120293/#:~:text=The%20log%20transformation%20is%2C%20arguably,normal%20or%20near%20normal%20distribution.
There are other methods to answer your research questions as well.
One last thing: Multiple regression assumes normality of the residuals, not the variables themselves. Scatterplots and examining multivariate normality can help. White's test is all possible.
We have discussed many times on ResearchGate that the data do not need to be "normally" distributed. It is somewhat desirable for the estimated residuals to be close to "normal." Also the examples I gave in the projects noted above show that skewed data are not a problem. I worked with them for decades.
However, many may try transformations to handle heteroscedasticity. As I also noted above, weighted least squares (WLS) regression where the regression weights in the model handle this makes such transformations undesirable, as I noted in an update to one of the projects listed. Transformations may not do a great job taking out the heteroscedasticity, and they can muddy the interpretation of results. I even once saw a transformation that had been done on some real estate data, as I recall, which still needed WLS regression, just with a different coefficient of heteroscedasticity. So the transformation was pointless.
As I noted above, there are examples in those projects. The most recent updates have scatterplots for them.
Essential heteroscedasticity is caused by the varying size of predicted-y values. With skewed data, heteroscedasticity may be more obvious because there will generally be a wide range of predicted-y values. IF heteroscedasticity is the issue, then WLS regression is more imperative.
Note:
https://www.researchgate.net/publication/320853387_Essential_Heteroscedasticity, as determined by Ken Brewer,
generalized linear models are specifically designed to accomodate non-normal distributions--lognormal, gamma Poisson, binomial. This is the bread and butter of modern statistics.
My dependent variable is vocational identity(measured with a scale). independent variables are family environment and life skills. The shapiro wilk values of these 3 variables show that there is no normality
The predictors don't need to be normally distributed. Select a generalized linear model that matches your dependent variable, which sounds like closer to categorical than continuous. So a multinomial.
Hello Samson Ayyamperumal. As others have said, neither the explanatory variables nor the dependent variable need to be normally distributed. As I said in my first reply, it is the errors that are assumed to be normal, but it is the sampling distributions of the parameter estimates that really need to be approximately normal.
But you need to tell us more about the vocational identity scale you are using. Please provide a reference or some other resource that explains how it is calculated. And bear in mind that fitted values from OLS regression models are conditional means. Therefore, it would only be sensible to use OLS regression if it is sensible and defensible to use means and SDs for descriptive purposes. HTH.
Bruce Weaver Why do you limit your discussion to OLS and normal theory when we have all the other distributions available through a generalized linear models?
Hello John W. Kern. I am not at all opposed to using a GzLM of some sort if it is appropriate. But I also think we do not yet know enough about Samson's dependent variable (a vocational identity scale) to make any good recommendations. YMMV.
You should first try log-normalizing your data. It actually depends on the distribution of your dependent variable. If you are using STATA software package, you can run the following syntax to determine which transformation is closer to normal distribution.
Samson noted that he wanted to know the "...non-parametric version of multiple regression," and said "My data set is highly skewed." - If the y-data are skewed, as in the case of establishment survey data, and the parametric predicted-y are similarly skewed, there is no reason yet to think you have to abandon the usual multiple regression. One can use a "graphical residual analysis" (which can be researched on the internet) to examine model fit, and a "cross-validation" (similarly researched) to consider whether one has overfit the model to a particular sample.
Considering what Samson said in the first of the March 5 responses above regarding his data, they need to be meaningfully measureable.
A response variable transformation here is not necessary and often degrades interpretation. It is good to have the estimated residuals close to "normally" distributed, not the response variable.
Some have asked: Why "mess" with distribution? That only makes since if one believes there is 1 distribution and the true value was chosen. When dealing with constructs and ordinal numbers, one makes decisions which might or might not be most representative. For example, on a scale 1-5, a 3 to a 4 might be a major difference and a 1-2 is virtually the same, etc. Again, one can overfit data, but one can underfit data. There is a simple exercise: Does the model describe and predict what it intends to do? One can test this finding (and depending on sample size and number of variables, etc., power will range from low to high). Simpler is often better, but comparison of models is easy with modern technology.
True, in some situations, transformations can distort--like a physical variable with a true 0. Meanwhile, this original poster is in social science research where his variables are arbitrary and are not ratio. 1-5 on a scale is used by convention with little thought what each point even means. It could just as arbitrarily be anything else, from dichotomous to a quadratic and anything else in between.
But, in categorical data, one sometimes transforms 0 by adding 1 to everything. If the resultant model predicts, then one can live with limitations.
There is not one easy, take-it-or-leave-it answer.
Hopefully everyone know normality is about residuals of the predictor variables and not the response variables. I don't think anyone has made that argument for quite a while. I know I did not. The original poster said his data set is highly skewed--sounds like everything. In that case, transformation is on the table.