Need help with determining linear equation and goodness of fit (R-squared) for a dataset with non normal distribution?

What is the response (dependent) variable, what is the predictor (independent) variable? Most important question: is a linear relationship between these two variables reasonable? Further: are there other (co-)variables to be considered (this is often the case in observational studies, where the observational units - typically patients- vary not only in the focused predictor variable)?

If the response variable is not normal distributed, then it is also extremely unlikely that the relationship to a predictor is linear. Typically, non-normal distributed response variables are modelled by generalized models that specify a more resonable distribution model (e.g. binomial, Poisson, gamma, ...) together with a link function that determines the form of the relationship between the predictor and the response (E(Y) = linkFn(X)).

Also, for non-normal distributed variables, R² is not defined. There are sometimes "pseudo-R²" values calculated by some software, but these are even more difficult to interpret than R². It is much more instuctive to give the estimated coefficients (e.g. the slope of the regression line) and (if prediction is the aim) the average standard error of the prediction and of the residuals.

Often in medical data there is a hell lot of noise, so clever thoughts about resonable distribution models and functional relationships are simply a waste of time, because there is anyway not enough specifically useful information in the data (in a scatter plot of the data you see some kind of a fuzzy cloud without a clear pattern). In such cases the best information you may extract is if the response trends upwards or downwards at least over the given range of predictor values. Here a simple linear regression is ok. However, I would not dare to report or interpret r or R². These are quite useless measures, particularily when the range of predictor values is not fixed experimentally. Better report the expected slope of the regression line instead, which is way easier to interpret. If there are no co-variables involved (that may be considered in a multiple regression or a general linear model), you may also test the rank correlation (Spearman's rho) to see if the data allow you to distinguish a positive from a negative monotone (not neccesarily linear) association. The draw-back here is that there is no slope estimate - so you can only say that the expected response increases (or decreases) with increasing values of the predictor, but you cannot say by how much (but that might be important to judge the relevance of the association).

Hasan Issa Mirza

العشوائية من الضرورة المحتمة لاجل تعميم نتائج البحث بافضل الصور الممكنة

Francis C. Dane

As indicated by Christian Geiser, linear regression does not require the variables to be normally distributed; only the residuals are required to be normally distributed. However, Pearson correlation does require the variables to be normally distributed, which is why R-squared from regression may not agree with r-squared from correlation. If you are interested only in whether or not the two variables exhibit an association, I suggestion you use Spearman correlation coefficient. If you are interested in predicting one variable from the other, the ordinary least squares regression is probably OK but, again as suggestion by Christian Geiser, check the distribution of the residuals for normality.

Thom Baguley

I don't understand this. If I correlate x and y and get .30 then the standardised slope of the regression between x and y will be equal to .30 also. The R^2 from the regression is .09 and equal to the square of the correlation. This applies regardless of whether x and y are non-normal. If you get different answers something has gone wrong.

More generally the correlation coefficient makes no assumption about normality at all. The assumption of normality applies to the statistical model used to derive inferences and hypothesis tests (e.g., p values).

Artyom Bannikov

As Thom Baguley wrote, linear regression does not make any assumptions about normality. The simple linear regression can be derived as least squares method, i.e. minimizing squares of distances between data points and the regression line. No normality, or any statistics at all is involved. But p-values that software outputs would be invalid without normality assumption. P-value computation does rely upon normality of errors.

The coefficient of determination R^2 is 1-RSS/TSS, RSS - sum of squares of residuals, TSS - total sum of squares. So, R^2 is interpretable without normality as well. This is variation that is explained by the model.

The next might be arguable. Do you need normality of errors at all? If your linear regression model is the final product, then, generally, yes. There is no other way to show that the model makes sense. But if results of your model can be validated in some other way. this may be different. For example, the parameter is estimated by linear regression, which is then used in a formula. And you have experimental data points, which can be correlated with your estimations. In this case imo normality assumption isn't that important at all. The main thing is you have a useful result.

James R Knaub

Graphics, as Christian Geiser noted, are very useful. You could research the term "graphical residual analysis." But that applies to the sample at hand, and you do not want to overfit to that. See "cross-validation." Also, you should not assume OLS, as heteroscedasticity in regression is natural. See https://www.researchgate.net/publication/354854317_WHEN_WOULD_HETEROSCEDASTICITY_IN_REGRESSION_OCCUR.

Also, yes it is often the case that you can use linear regression, even straight line linear regression, with skewed or otherwise non-normal data. For example, see https://www.researchgate.net/publication/362370770_Application_of_Efficient_Sampling_with_Prediction_for_Skewed_Data.

(Note: Data are normally not "normal.")

Please explain how the plastic input value should be considered from the true stress-strain curve for the bilinear elastoplastic material model ?

How to increase citation in Research Gate?

How to do wavelet transform of EXAFS data?

Following click reaction in cell lysates, protein is immobile and remains at the top of the gel in SDS-PAGE?

How to use energy flexibility in inventory modeling?

I want to get updates regarding international law, international trade law, international institutions and human rights etc?

What is the principle/mechanism behind aging of carbon (graphite) containing refractory mix for isostatically pressed refractories?

How to remove cell debris after tissue dissociation?

What methods can be used to minimize environmental variability in breeding trial ?

Is profit margin a good variable to measure the market power of a company?

Is there an alternative to a multinomial regression which allows the DV to be non mutually exclusive?

In order to run Multinomial Logistic Regression, is it required that the data be in the long format?

Absorption coefficient of methane?

How to report results of Generalised Linear Mixed Models in a journal article?

Request a single Lecture notes for math as detailed as this that I can find in one place?

Which test should be used to study association among demographic profile and awarness level?

Why 3 replicates for most biological assays? Is it enough to examine the data fits normal distribution?

Normality assumption for linear regression is The assumption of normality is whether for residual errors or predictor variavble?

Posthoc test lettering in JAMOVI?

SAS Generalized Linear Model for trial/event anaysis and not survival (time to event) analysis?