Significance of variables in a multiple regression model

01 January 1970 3 2K Report

The output of a multiple regression contains the intercept, parameter estimate coefficients (Beta), "t" and 'F' values, R2 and the test of significance (p - values). In R software, they can be displayed using various functions, such as "summary ( )", " coef ( )" and "lm.beta( )" functions. From these statistics and coefficients, we try to estimate the variable with the highest significant in the model.

Variable importance in the model is mostly indicated by R2 and the p - values. The variables with marginal or low significance have p -values higher than the threshold significance (for instance, p = 0.05), and their inclusion or exclusion does not affect the percentage of variance explained by the model (we can use confidence intervals to be more precise).

Insignificant variables are often be eliminated from the model using backward, forward and stepwise elimination procedures.

The standardized coefficients and their corresponding p - values may also provide a standardized way to compare the effects of independent variables which have different metric units. Nevertheless, because the independent variables are usually correlated, we need find a more robust variable importance selection analysis such as dominance analysis, elastic net, random forest and Boruta to determine the actual importance of an independent variable. How do we select a variable importance selection criteria?

James R Knaub

Job -

You are correct that because of influences between predictors, notably collinearity, you cannot simply compare p-values. In fact p-values cannot be used to throw out or keep a variable either without at least considering effect size and sample size, and that is still problematic. The "importance" of each predictor depends on what other predictors are used, and how, and the best predicted-y comes from the best set of predictors used the best way. Hopefully principal components won't be necessary for good prediction as then any chance for explanation will be virtually gone. (See Galit Shmueli:

https://www.researchgate.net/publication/48178170_To_Explain_or_to_Predict. The example in the appendix may be particularly instructive.)

I suggest a "graphical residual analysis" to study and compare model fit for a given sample, and study of a "cross-validation" to help avoid fitting so closely to the sample that prediction for the remainder of the population or subpopulation which you are trying to model may may be too degraded. The graphical residual analysis may also indicate heteroscedasticity. See https://www.researchgate.net/project/OLS-Regression-Should-Not-Be-a-Default-for-WLS-Regression, with various updates in reverse chronological order.

The R-square can be quite problematic. See https://data.library.virginia.edu/is-r-squared-useless/ where it is even remarked that R-square isn't even a measure of fit. I think they say that because the estimated sigma part can be big and you may still have small standard errors for regression coefficients.

You mention forward and backward elimination, but that is unlikely to find the best set of predictors. You could research "model selection," but I think you should concentrate on what makes sense from a subject matter perspective. You can compare models for different samples using graphical residual analyses and cross-validations. An intercept term should only be used if it makes sense. If y should be zero when all predictors are zero, then there should be no intercept term.

So I suggest dropping consideration of p-values, R-square, and forward and backward elimination. I suggest using graphical residual analyses, cross-validations, and consider your subject matter.

As for "...variable importance selection criteria...," the idea is to find the best set of predictors used in the best way. It is said that unnecessary complexity generally increases variance, and oversimplification generally increases bias. But those two comments assume you know you have sampled well enough to cover the population or subpopulation of interest.

Best wishes - Jim

Proloy Barua

I usually do as follows for selecting model

1. Included Independent and dependent variables in the model based on extensive review of literature

2. Compare subsequent models with goodness of fit test

3. Finally select the best model and Predictors of outcome/dependent variables

Xinhai Li

If you want to compare the importance of independent variables, you can use anova(lm(Y~X1*X2+I(X1^2)+I(X2^2))), it gives you the sum of squares of every variable/term, which is how much variance of Y they explained. There are also other indices for importance, such as contribution, fraction [a], and partial R square. The sum of squares is the most popular one. Please remember to include interaction terms and quadratic terms as initial full model (as I did above), and do model selection to remove nonsignificant variable/terms.

How can I publish my article below on Research Gate?

Does cationic starch form a starch-iodine complex similar to amylose?

Which code can I use to calculate the lattice thermal conductivity of thermoelectric compounds?

While running the last step of “x_trans BoltzTraP” I came across the issue indicated below; what could be the reason?

What are the current challenges of E-facilitation particularly for the facilitator and learners?

What is the best method for concentration and purification of viral particles (Sars-Cov-2 in this case)?

How is it possible to measure a electrical current created by an electron beam on an aluminium volume?

How to make graphitic particles from nano-sized carbon particles ?

What features or Softwares can be build to make IT greener in our Environment?

What tools/organisms should one use to discover a new regulatory protein for homeostasis of a divalent cation?

Hello researchers Is this a random laser or just fluorescence?

How to define an anisotropic material with asymmetric elastic compliance/stiffness matrix in ANSYS APDL?

What is the acceptable p-value cutoff for GO enrichment analysis ?

How to do Mann-Whitney U test with Bonferroni corrected p-values?

Bonferroni correction. I have independent t-test, paired t-test and ancova conducted. Which test would require Bonferroni adjustment?

Can we eliminate the stress singularity at the tip of the crack by manipulating the elastic constants?

What is the impact of collaborations with key suppliers on an SME's competitiveness?

Is it redundant to use both Random Forest and Decision Tree algorithms in the same regression project?

How do you feel about social exclusion in transport within South Africa?

Chi-square test for allele distribution?