Can i run an OLS regression?

You seem to be mixing realtionships, error models, and linearity in the relationship and in the predictors...

OLS = ordinary least squares. This is a criterion for the stochastic part of the model (error mpdel) to identify a "good fit": the values for the coefficients are chosen in a way that will minimize the "sum of squared errors". This relates only to the properties of the stochastic part, not to the functional part of the model. This means that the relationship between your variables can have any form (linear, quadratic, exponential, sinuosid, logarithmic, logistic, etc. etc.) - that does not matter. What matters is that the probability distribution for the response, conditional on the values of the predictor[s], is approximately normal: y ~ N( E(Y|X), s² ). Then the OLS solution happens to give those values for the coefficients for which the observed data will be most likely (given your model) (this is the "maximum likelihood solution" of which the OLS is just a sepcial case).

The solution ("fit") can be found by some relatively simple matrix algebra - if the coefficients are not transformed or inside some non-linear function. Models with coefficients only in their "linear" form are called linear models. Linear models can model non-linear relationships between variables. Only the coefficients must not be inside some non-linear term. So

y = bx + a

is linear, because the coefficients a and b are both "linear". This model also models a straight-line (linear) relationship between x and y. The models

y = cx² + bx + a

y = b*exp(x) + axy = c*sin(x) + bx³ +a

are also linear models, but they do not model linear relationships. Finally, the models

y = sqrt(bx) + exp(x/a)

y = b*sin(ax)

are examples for non-linear models.

The solution of such non-linear models may also be the OLS only that this solution cant be found by matrix algebra and usually has to be searched (using some more or less smart algorithm).

When the conditional probability distribution of the response is not approximately normal, the OLS solution does not give the coefficient values for which the observed data would be most likely. Here the likelihood function of the data has to be determined more or less explicitly to find its maximum (= the combination of values for the coefficients for which the data is most likely). The "shortcut" via the minmization of the "sum of squared errors" is not possible anymore. Again, this is not related to the functional relationship between the variables. As an example think of the number of radioactive decay events over time: the expected number increases linearily with time, but the numbers show a Poisson distribution (which is not normal). So the functional part of the model is linear and the stochastic part is Poisson.

The maximum likelhood solution will converge against the OLS solution when n -> Inf. Therefore, if your sample size is large, OLS may give you a useful approximative solution even for cases where the conditional probability distribution of the response is not normal.

Saadia Irfan

Hi thanks for your answer. I am really graeful. I am a bit confused now. Please see the attached file in which I have shown the exact issue. I hope you can help me out.

kind regards

Jochen Wilhelm

Using the normal error model for this data is a bit optimistic. It is clear from the plot that the response is not normal distributed (the distribution at each "boardsizew" seems right-skewed). I have no ide what "absoluteofdiscretionaryaccrww" is, so I can't guess what distribution would be sensible to assume.

Whatever the relationship is, the noise in your data is considerably larger, so there won't be anything much you can infer about the relationship, no matter what model you chose.

The r²=0.0000 does not indicate that the relationship is "not linear". It just sais: using a linear relationship to model the data and fitting the coefficients by minimizing the sum of squared residuals, the model is able to explain less than 0.005% of the variance of the observed values.

Using your model (the linear functional model with assumed normal errors), the slope of the regression line is estimated to be 0.003 (per "boardsizeww"), and your data is considered "not too unexpected" for similar models with slopes between -0.0090 to +0.0095 (as given by the 95% confidence interval for "boardsizew"). Given the model specification would be more or less sensible, the next step would be to think if this range covers some relevant value, or if such slopes are completely irrelevant.

Stephen Politzer-Ahles

OLS regression can still be used to test many kinds of curvilinear relationships, if you add polynomial terms. However, as Jochen mentioned, your data might be problematic for other reasons; the outcome variable looks skewed, and the predictor variable has very few levels. You might want to treat the predictor as ordinal rather than continuous, and you also might want to do bootstrap regression rather than OLS.

Sergiy Prykhodko

Hi Saadia,

Your task is similar to another task, see Ali’s question and the answers:

https://www.researchgate.net/post/transformation_of_data_for_normality#view=5722c8955b4952500425dca1

Attach for us the data and I'll try to help you in building the non-linear regression.

Saadia Irfan

thanks everyone for your valued out put. Sergiy please find attached my data. x14 x15 and x17 are dummy variables coded either 1 or 0. I hope you will be able to help. Looking forward.

Saadia

Saadia Irfan

Its a cross sectional data by the way.

Stephen Politzer-Ahles

Without having said which variables are which, it's going to be hard to help. Which IV is the one that showed a non-linear relationship with the DV?

Saadia Irfan

hi thanks for your reply, you mean I should give the description of the variables too?

please find attached the file. the description is in tab 2.

kind regards

Saadia Irfan

almost all indpenedent variables show a non linear (rather random) relationship with the dependent variable...

Stephen Politzer-Ahles

Well, I made scatterplots of your data, but beyond that I don't know what more I can do to help. You haven't specified which variable you are interested in, and you have 18 variables of different types, which have different issues (e.g., some are categorical and some are not; some have normal distributions and some have serious abnormalities) and thus will need to be treated in different ways. If you don't state which specific variable you are interested in, you probably won't get any specific suggestions.

But in any case, inferential statistics should not be done without specific predictions anyway. If you just wanted to explore a lot of variables and visualize their effects, you have already done that just by looking at the scatterplots. If you didn't have clear predictions about the pattern of effects, and then you see from the scatterplot that a pair of variables shows a "rather random" relationship, then the proper conclusion is that you didn't find a significant relationship; it wouldn't be honest to try hacking the statistics to try to make a relationship appear.

Why is Stata taking forever to give results?

Can someone help me with stata commands?

What is pseudo r2?

Can r2 be as high as 0.60 in cross sectional data?

Why median regression (quantile regression) results wont display f statitics in stata?

Why is stata not giving me the f statistics for the time fixed effects regression when i use the clustered standard error option ?

Regarding sample selection?

Does SDC platinum include firms listed on NYSE and NASDAQ both?

What does pooled cross sectional regression mean?

Can i get helpp with the gsreg command in stata?

Ethylene glycol is newtonian or non newtonian fluid?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Is there an alternative to a multinomial regression which allows the DV to be non mutually exclusive?

In order to run Multinomial Logistic Regression, is it required that the data be in the long format?

If we are using snowball sampling technique, how do we justify the true representativeness of the sample statistically? is there any statistical test?

How to report results of Generalised Linear Mixed Models in a journal article?

What are possible strategies can be used to analyze data under sequential explanatory mixed method approach?

Request a single Lecture notes for math as detailed as this that I can find in one place?

Why is nonpoint source pollution potentially more harmful and difference between point and nonpoint sources of water pollution?

How do living organisms play a role in the water cycle and why is nonpoint source pollution potentially more harmful than point source pollution?