Which regression is more effective for data calculation linear regression or stepwise regression?

stepwise methods seem like the answer to the problem of having a lot of possible predictors and not knowing which ones to put into your model. In fact, the problem is that you don’t know which variables to put into your model. Specifying the variables in the model should be done based on your hypothesis, which should build on previous research. Collecting data without a planned model is like shopping without having a recipe in mind. You end up with half the ingredients needed for half a dozen recipes. Science – same thing.

If you don’t have a hypothesised model and you’ve gone ahead anyway and accumulated data, there are still very important reasons for not letting stepwise methods act as a substitute for having a theory and a hypothesis. Briefly, these are :

1. The p-values for the variables in a stepwise model do not have the interpretation you think they do. It’s hard to define what hypothesis they actually test, or the chances that they are false-positive or false-negative.

2. The variables selected may not be the best subset of variables either. There may be other equally good, or even better, combinations of variables. One simple solution is to test all possible subsets of variables. And, like all simple solutions to complex problem, it's wrong. You end up with an unreproducible, atheoretical model that has sacrificed any generalisability to the task you gave it, which was fitting a particular sample of data.

3. The overall model fit statistics are wrong. The adjusted R2 is too big, and if there were a lot of variables not included in the final model, the adjusted R2 will be a massive overestimate. R2 should be adjusted based on the number of variables entered into the process, not on the number actually selected.

4. Stepwise models produce unreproducible results. A different dataset will, most likely, give a different model, and a stepwise model from one dataset fitted to a new dataset will fit badly.

5. But the most important argument is that stepwise models break a fundamental assumption of statistics, which is that the model is specified in advance and then the model coefficients are calculated from the data. If you allow the data to specify the model, as well as the coefficients, all bets are off. See the Stata FAQ :

https://www.stata.com/support/faqs/statistics/stepwise-regression-problems/

I can do no better than quote Kelvyn Jones, a geography researcher significant enough to have his own Wikipedia page : There is no escaping of the need to think; you just cannot press a button.

Essentially, stepwise methods break the first rule of data analysis :

The software should work; the analyst should think.

Bruce Weaver

I agree with Ronán Michael Conroy's comments on the classical "stepwise" methods. But note that nowadays, newer methods that involve shrinkage, or penalization for model complexity are available (e.g., LASSO). You can read a bit about them here, for example:

https://www.lexjansen.com/nesug/nesug09/sa/SA01.pdf

HTH.

Bruce Weaver the fundamental problem with all automated variable selection procedures is that they build a model for which you will have to invent a theory!

I am in general agreement with you, Ronán Michael Conroy, particularly when regression is being used for "casual analysis", as Paul Allison describes it in this blog post:

https://statisticalhorizons.com/prediction-vs-causation-in-regression-analysis/

For purely predictive modeling, on the other hand, I might be more willing to give LASSO a try. Would you consider using it in that context? Or are you generally opposed to it in any context?

Seyyed Amir Yasin Ahmadi

Maruf Adnan

It depends on the aim of modeling. There are many practical aims including:

1- Prediction: like machine learning without considering the logic and temporality of each covariate

2- Explanation

3- Explanation with causal inference

4- Adjusting confounders

5- ...

Does water pressure dominate fish body shape?

Is the prevalence of shoulder frozen is higher in Diabetic patients?

Does osmoregulation important for the freshwater fish?

Why do most modern researchers partly or entirely depend on using (AL tools) for writing or reviewing research articles?

Cfd analysis in ansys ?

I am fixing yeast cells and staining with DAPI. Can anyone share insights on what can be the reason behind this low bright confocal image?

Why electron and hole thermal velocities are same for all materials in scaps simulation, why?

Why pH balance is important for the fish culture?

How to classify functionalization strategy of MOF?

What about an idea which considers a chitin -chitosan mixture are a mixture of different polymers?

Is this a facetotecta nauplius?

May members post flyers about opportunities to present at a conference? If so, where to post?

GC-MS retention index prediticon?

Hello all, Looking for international reviewer to review Ph.D thesis in wireless sensor network.Can anybody help?

Is there an alternative to a multinomial regression which allows the DV to be non mutually exclusive?

In order to run Multinomial Logistic Regression, is it required that the data be in the long format?

Research Methodology - Impact of Corporate Reputation on Stakeholders Behaviors?

Why do we equate male and female arousal?

How to report results of Generalised Linear Mixed Models in a journal article?

Why results of ROS flurescence are negative as there was no bacteria within?