stepwise methods seem like the answer to the problem of having a lot of possible predictors and not knowing which ones to put into your model. In fact, the problem is that you don’t know which variables to put into your model. Specifying the variables in the model should be done based on your hypothesis, which should build on previous research. Collecting data without a planned model is like shopping without having a recipe in mind. You end up with half the ingredients needed for half a dozen recipes. Science – same thing.
If you don’t have a hypothesised model and you’ve gone ahead anyway and accumulated data, there are still very important reasons for not letting stepwise methods act as a substitute for having a theory and a hypothesis. Briefly, these are :
1. The p-values for the variables in a stepwise model do not have the interpretation you think they do. It’s hard to define what hypothesis they actually test, or the chances that they are false-positive or false-negative.
2. The variables selected may not be the best subset of variables either. There may be other equally good, or even better, combinations of variables. One simple solution is to test all possible subsets of variables. And, like all simple solutions to complex problem, it's wrong. You end up with an unreproducible, atheoretical model that has sacrificed any generalisability to the task you gave it, which was fitting a particular sample of data.
3. The overall model fit statistics are wrong. The adjusted R2 is too big, and if there were a lot of variables not included in the final model, the adjusted R2 will be a massive overestimate. R2 should be adjusted based on the number of variables entered into the process, not on the number actually selected.
4. Stepwise models produce unreproducible results. A different dataset will, most likely, give a different model, and a stepwise model from one dataset fitted to a new dataset will fit badly.
5. But the most important argument is that stepwise models break a fundamental assumption of statistics, which is that the model is specified in advance and then the model coefficients are calculated from the data. If you allow the data to specify the model, as well as the coefficients, all bets are off. See the Stata FAQ :
I can do no better than quote Kelvyn Jones, a geography researcher significant enough to have his own Wikipedia page : There is no escaping of the need to think; you just cannot press a button.
Essentially, stepwise methods break the first rule of data analysis :
The software should work; the analyst should think.
I agree with Ronán Michael Conroy's comments on the classical "stepwise" methods. But note that nowadays, newer methods that involve shrinkage, or penalization for model complexity are available (e.g., LASSO). You can read a bit about them here, for example:
Bruce Weaver the fundamental problem with all automated variable selection procedures is that they build a model for which you will have to invent a theory!
I am in general agreement with you, Ronán Michael Conroy, particularly when regression is being used for "casual analysis", as Paul Allison describes it in this blog post:
For purely predictive modeling, on the other hand, I might be more willing to give LASSO a try. Would you consider using it in that context? Or are you generally opposed to it in any context?