Stepwise or standard multiple regression?

Megan Wood A typical multiple regression will show you the variance explained by all the predictors included in the model at once. Stepwise regression is used to see how the variance explained, R2, changes by adding (or removing) each predictor to the model one at a time. In short, stepwise regression helps you assess the relative importance of each predictor and answer the question, "Does my model do a significantly better job at predicting the outcome variable when I add (or remove) particular predictors?"

Bruce Weaver

Hello Megan Wood. I assume you're talking about the same problem you mentioned in this thread:

https://www.researchgate.net/post/Which_IVs_to_add_into_multiple_regression

Is that right? If so, I wonder if it would be helpful to keep all of the discussion of the problem in one thread. YMMV.

Rolando Gonzales Martinez

Megan Wood , stepwise regression is just garbage. See the excellent discussion in:

Harrell, F. E. (2001). Regression modeling strategies: With applications to linear models, logistic regression, and survival analysis, Springer-Verlag, New York. Miller, A. J. (2002). Subset selection in regression, Chapman & Hall, London.

David Morse

Hello Megan,

In concert with Ronaldo Gonzales' post, I too would urge you to avoid any of the "step" methods in arriving at a regression model. Generally, if you're conducting an "exploratory" analysis and had some defensible reason for considering a given set of IVs in the first place, it's difficult to understand why you wouldn't evaluate a full model as your starting point. Yes, subsequent inspection, and cross-validation might cause you to revise it, but step methods won't necessarily give you what you're after. Here's why:

1. Step methods are very opportunistic, and the resultant models may not be stable across samples (let alone hold for the population). Hence, generalizability is a concern; validation samples are a must.

2. There is no assurance that step methods will arrive at the "best" ensemble of IVs for a given DV, regardless of your criterion for "best."

3. Minor adjustment of variable entry/variable deletion criteria can affect the performance of step methods, in unpredictable ways.

4. The internal significance tests are evaluated incorrectly in a number of software packages (e.g., the tests are frequently too liberal).

5. Step methods will frequently omit variables which could help the model's performance, due to phenomena such as the suppressor effect.

Good luck with your work.

Bruce Weaver

The Stata FAQ on (old-school, unpenalized) stepwise regression includes many of the same points David listed above.

https://www.stata.com/support/faqs/statistics/stepwise-regression-problems/

Megan Wood

Thanks everyone for your help! I will go ahead and compute a standard multiple regression using the Enter method. Thank you again

David Eugene Booth

Just to add a little more the attached paper shows that stepwise methods are not reproducible and cannot ever be trusted.

David Booth

David Eugene Booth

DEAR BRUCE, BTW THERE'S NO SUCH THING AS PENALIZED STEPWISE REGRESSION. Unless you mean stepwise regression itself which has a least squares penalty if you want to call it that and that's unknown in my experience. Austin and Tu reference above shows it's not reproducible. Please 🙏🙏🙏 finally read the Austin and Tu reference. David Booth

Bruce Weaver

DEAR DAVID, I used that wording so that you would not think I was including LASSO in the list of methods I was decrying. (I may have misunderstood, but I thought you believed I was doing so in another thread a little while ago.)

You seem to believe that I endorse the use of one or more of the classical stepwise methods. I don't know where you got that idea. To be clear, I do not support the use of any of the following variable selection methods:

Stepwise
Forward selection
Backward elimination
All possible subsets (see Frank Harrell's comments in the Stata FAQ on stepwise regression--I posted the link earlier in the thread)

I hope this clarifies things.

David Eugene Booth

Bruce I am tired of arguing with you. IMO and that of other statisticians the Austin and Tu reference I suggested above shows that stepwise and it's variants are not reproducible and hence of zero value to science. In addition the penalty factor is defined many places. So I will leave that to you. However I believe the term was first used in reference to ridge regression not stepwise. I refer you to the early work showing that there's a relationship between ridge regression and Bayesian statistics. I refer you to elastic net/glmnet literature as well. You can find a brief introduction to these in Efron/Hastie Computer aided statiistical . methods. Published 2016 I believe. Best wishes for a good day. David Booth

Bruce Weaver

David, I too am tired of this. But I do have one final suggestion. Please direct me to any specific statements or conclusions in the Austin and Tu article that I have contradicted. Thank you.

Megan Wood

Hi all, just another quick question if anyone is able to help. In a similar paper; the authors only add in the independent variables that significantly correlated with the DV when running correlation tests. I have ran a multiple regression for one DV with all of my IVs, which is nonsignificant. When I run the multiple regression by adding just the significant correlations, the regression model is significant. I’m unsure which is the right way. I have ran all the the assumptions before running the regression eg visual inspection or a linear relationship between each IV with the DV, but unsure if it actually needs to be significant as per the p value

hope this makes sense

Megan

Bruce Weaver

Hello Megan Wood. See the sections with the following headings in Frank Harrell's checklist (link below):

Use of stepwise variable selection
Lack of insignificant variables in the final model

https://discourse.datamethods.org/t/author-checklist/3407

One further comment. You said, "I have ran all the the assumptions before running the regression..." (emphasis added). Given that the major assumptions are about the errors, you cannot check them until after you have estimated the model.

HTH.

What GPower tests to report?

What test to use for continuous dependent variable and categorical independent variable?

Added new variables, how to deal with missing data?

How to learn more about SPSS and its Application?

Is there an alternative to a multinomial regression which allows the DV to be non mutually exclusive?

In order to run Multinomial Logistic Regression, is it required that the data be in the long format?

If we are using snowball sampling technique, how do we justify the true representativeness of the sample statistically? is there any statistical test?

Normality assumption for linear regression is The assumption of normality is whether for residual errors or predictor variavble?

How to back transform the results generated from analyses using log transformed with In(X+1) data?

Why applying a saved template of a bar chart in a different/new one doesn't take effect in SPSS?

Paired t-test or unpaired t-test for my quantitative data with SPSS?

Is it redundant to use both Random Forest and Decision Tree algorithms in the same regression project?

If in a panel data, T>N then which model is appropriate ?