How to select independent variable as predictor in multiple linear regression analysis?

Aarti -

I think originally you said two of those independent variables were "soil moisture" and "rainfall." Those could overlap in the information they provide. When this happens there is collinearity. (This generally happens, but is a matter of degree.) To use more independent variables than needed generally adds to variance, and can possibly lead to coefficients with a different sign than would occur if that independent variable were used alone. To leave out variables with needed information, however, can create "omitted variable bias." (Some may solve this using "principle components," a method which may help with that problem at the expense of interpretability.)

I suggest you experiment, as I think you previously said you did, with different sets of independent variables, but see which ones have lower variances of prediction errors. You could withhold some known dependent variable data, predict for them using the main bulk of data, and compare performances of competing models. That is, using the different combinations of independent variable data, which came closer to estimating ("predicting") the y-values withheld? You should also consider what makes sense given your subject matter. Does it fit subject matter theory? Otherwise, sometimes spurious results might occur. Can results be repeated?

Graphical studies, comparing variables in scatterplots, may help decide what variables and models to use, when considering continuous data.

"Significance" is a misleading and incomplete concept. Don't worry too much about that. Even relative values may be misleading here due to interaction between your 'independent' variables.

Regarding intercept values, consider the situation where every independent variable is zero. Would you expect a nonzero value for the dependent variable in that situation? If not, you should probably drop the intercept term, thus setting it to zero. I suspect that in such a case, the standard error for such an intercept would have been relatively small.

An intercept may, however, serve to partially correct for missing variables.

- Experiment.

Note that there are different types of regression models appropriate to different types of data, say continuous or count, etc., and linear or nonlinear relationships. Also, what has been found to help in problems similar to yours may be of interest. You showed me a version of this question which had named the variables. If you provide that information again, then perhaps someone experienced with your subject may have more ideas here.

Best wishes - Jim

PS - If I remember correctly, the first draft of your question made me think you wanted to know why some coefficients were much bigger than others, and I wondered if you were considering the units used. That is, if one coefficient is about ten times another, but the values of the variable generally only about a tenth as much, then they represent about the same contribution to the value for y in a multiple linear regression. At any rate, various interactions of variables, as noted earlier, "muddy the waters" anyway.

James R Knaub

Note again that eliminating coefficients based on p-values may oversimplify. There may be substantial interactions, such as collinearity, between independent variables. Also note, backward and forward elimination can be too arbitrary.

You could pick your model based on performance and subject matter relevance, as previously noted. For performance, hold out some known y-values, as noted, and use them to compare to predicted values to validate models. (There is an area of "statistical learning" known as "cross-validation" you may want to study.) Also, the estimated square root of the variance of the prediction error, which is also impacted by bias, can be used for comparisons. This could be found, for example, in SAS PROC REG as STDI. Note also you may have nonlinearity, as also noted above, and scatterplots may help you decide that.

Aarti Soni

James

Thanks a lot for the response but I did not understand answer for the last question (Does negative coefficient value of C depict, C has negative impact on E or something else?).

Aarti Soni

Srinivas-

Attached video and document really a great help to me. Thank you.

James R Knaub

If you are saying it is appropriate to use multiple linear regression with your data and variables, and you have a negative regression coefficient where you expected to have a positive one, that could be because of collinearity. You could check by using only that one regressor to see if it is positively correlated with the dependent variable. If it is positively correlated, yet the negative coefficient occurs in multiple linear regression, then you probably have substantial collinearity, and it is likely particularly inappropriate to use that p-value method of selection. The influence of that regressor and others are muddled. (There are also other ways for the independent variables to interact to the detriment of a straightforward analysis.)

It is better to test the performances of competing models with test data.

So ... ideally, the coefficient sign would indicate the direction of the relationship with the dependent variable. In simple linear regression it does. For multiple linear regression, it might not. Collinearity can substantially impact a coefficient, and in the worst cases, even make it change signs.

(On another topic, note that this kind of regression is for continous data. For other types of data, other types of regression may be appropriate.)

James R Knaub

Actually, the letter of mine attached above is with regard to basically a different problem with p-values than the one found using them here.

Ovidiu Tatar

"Does negative coefficient value of C depict, C has negative impact on E or something else? I assume you are doing a linear regression". A negative coefficient for C means that that for a unit increase in C (e.g. if C is continuous such as age than for one year increase in age) your outcome E decreases with the value of the coefficient (when all other variables remain constant).

"And why D is showing such a high coefficient value?" You should not be extremely bothered about D is it is insignificant (p>0.05 if this is the cutoff you are using). For unexpected high coefficients it is a good practice to look into that independent variable to see the number of cases, it's distribution, etc. Regarding retaining D in your regression model, this depends on what you want to control for. If you want to study the influence of D on E than you should keep it even if it's insignificant. Don't forget about doing other model fit statistics to test your model. I hope this helps!

Aarti Soni

James and Tatar,

Thank you very much for suggestions and the detail explanation helps a lot to understand my doubts.

James R Knaub

Aarti -

Thank you. Hope I did not miss anything important.

A final suggestion: If you have "soil moisture" and "rainfall" both as regressors/independent variables, might moisture cover the same information provided by rainfall, and more? What if you dropped rainfall? I don't really know about this. I hope an expert in your field responds to your question. I'm just guessing about this.

Best wishes - Jim

Aarti Soni

James,

Everything is covered in your explanation and Yes you are right, soil moisture is directly related to rainfall. But soil moisture is estimated in different soil layers which shows different impact on evapotranspiration.

Dr. Vishal Singh

do a principal component analysis to minimize the co-linearity among the predictors. the co-linearity among the predictors will reduce the real significance of each predictor.

Mubasiru Lamidi

As far as I know, when you are fitting linear regression model, there are some things you need to keep in mind. Firstly, you need to find out if your regressors/independent variables are intercorrelated. Once you determine that, you need to use univariate analysis to examine the level of relationship between each independent variables and your outcome. Any one that is highly associated with your outcome have to make it into your final mode. Having fitted your model, you can use VIF (Variance Inflation Factor) to examine the multicollinearity among your independent variables. This is in addition to James and Vishal's comment.

James R Knaub

Aarti -

Recall that in my first response, I included this:

"(Some may solve this using "principle components," a method which may help with that problem at the expense of interpretability.)"

Things can get complicated, but the bottom line is performance. So I don't think I can stress enough the usefulness of validation by withheld data for testing.

Cheers - Jim

PS - You might see if you can also get your question listed under a topic that indicates your subject matter. That way you may also have subject matter experts "weigh in." I'd like to hear from them myself. I have some other things preoccupying me, and hope announcements from this thread don't get buried in my email, but this sounds like an interesting application, and I'd like to see what subject matter experts have to say.

How to determine the impact of climatic factors on evapotranspiration (ET) using factorial experiment?

How to calculate sensitivity coefficient ?

Where can I find the global net solar radiation and sunshine hours data set?

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How are iso-frequency contours plotted?

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?