What are the problems inherent to the use of linear regression for multimodel selection?

Guilherme Sobrinho @Guilherme-Sobrinho

18 March 2015 9 7K Report

What are the limitations of linear regression methods for choosing one model amongst a group of n candidate models?

Raid Amin

There are quite a few factors that could impact your model selection:

1. overfit

2. underfit

3. multicollinearity

4. bias

5. nonlinearity

6. outliers and leverage points

7. ... other?

Guilherme Sobrinho

Dear Professor Amin, thank you for your answer. Would you recommend some references for me to learn more about such factors? And yet, would you suggest any other methodologies for multimodel selection (as AIC/BIC, AICc, etc)?

Timothy A Ebert

You can use a stepwise regression (forwards, backwards, or stepwise) to select a model. The method can use r-squared, p-values, AIC, or possibly some others depending on software. The problem is that if the issues Raid raised are present in your data, then you can end up with a local maxima. You get a model with an r-squared of 0.37 if you start with all of the variables, but if you remove "this one" so that the program cannot select it, then you get a better model (or maybe just a different one).

It also depends on how you want to define a better model and what constitutes better. Is a model with a p-value for F test of 0.006 really better than one with a p-value of 0.007? Numerically, yes, but does this difference have any real impact on your interpretation of the data or the utility of your research? How does your interpretation change between the two "best" models?

You can also run the selection process several different ways, using several criteria (changing the value to enter and value to remove in a stepwise regression). How does the best model change depending on the method used for selection? If they all disagree, then you will need to figure out why (see Raid's answer for a good list of potential causes).

It may be appropriate to discuss several of the better models. This might be a great start to planning future research.

Guilherme Sobrinho

Dr. Ebert, thank you very much for the reply. The definition I am willing to use as a first approach for the best model is simply its "fitness", i.e. how well a given model fits to a given experimental data. For such approach it seems reasonable to consider the residuals for a linear regression analysis. I am not considering the variables involved (yet), but the global results. Therefore, I am not performing any sort of multiple regression analysis. Also, I am considering only empirical models that are very well established in literature.

Timothy A Ebert

You could try leaving out one or more data values at random, building the model, and then seeing how well the model predicts these "new" values. You could either rebuild the model each time using stepwise regression, or you could use the result from stepwise regression with all the data and just recalculate the model coefficients. You then take the difference between the "new" value and the predicted value, and repeat several thousand times (if you have enough data). This will give a somewhat different answer from just looking at the residuals, though the form of the output will still be observed-expected. This is a Jackknife procedure, very well established in the literature -- though typically the Jackknife method would be leaving out observations to see the effect on estimates of the regression coefficients (or some other statistic).

The key here is having enough data.

Raid Amin

Dear Guilherme,

I can try to explain to you what I happen to be doing for model selection in regression, using SAS. You can use other software package, of course. This may be an imperfect methodology, but it works well for me.

1. Check for outliers and high influence data values.

proc reg;

model y=x1-x10/influence;

run;

I use the covariance ratio COVRAT, R-Student, Hat diagonals, ... etc. to assist me in idenifying data values that can be removed.

2. I check for multicollinearity.

proc reg;

model y=x1-x10/vif collin;

run;

Identify which variables should not be used in the same models.Use the results from steps 1 and 2 in the following steps.

3. Model selection based on good fit first.

proc rsquare adjrsq mse cp;

model y=x1-x10;

run;

Mallow's Cp provides information on how to balance between underfitting (you get bias) and overfitting (you get large prediction variances).

I select a few candidate models that are best for fitting and that show promise to be good for prediction.

4. I obtain for each chosen model the PRESS statistic. Basically, it is the SSE that measures stability of the model with new data.

proc reg; model y= x2 x4 x6/cli p; run;

proc reg; model y=x1 x2 x4 x5/cli p; run;

proc reg; model y=x1 x3 x4 x7/cli p; run;

proc reg; model y=x1-x10/cli p; run;

5. Summarize all results in a summary table for the chosen models, plus the full model. You will be able to have more than excellent model to choose from. I emphasize to students that each "x" is not just an x. There is a cost and there is measurement error, and there is importance in the study ... etc. Having alternate choices for models is very useful.

I recommend not to use any stepwise regression at all. Such model selection methods can be misleading and they do not fully take into account multicollinearity. The F tests from stage to stage are not independent of each other.

The regression text by Montgomery, Peck, and Vining is excellent.

Douglas Montgomery and Geoff Vining were taught regression by Ray Myers at Virginia Tech, and so was I. The above steps were taught to us by Dr. Myers.

SAS also has lots of information online.

Google: SAS regression model selection

Google: SAS influence diagnostics

Google: SAS multicollineaity

Google: SAS Mallow's Cp

Weisheng Zeng

Dear Guilherme,

If you assess models simply using "fitness", it is reasonable to consider the residuals for regression models, liner or nonlinear. Then, R^2 and SEE are the important statistics. In general, nonlinear model is better than linear one, because linear model is only a special case of nonlinear model. Besides R^2 and SEE, we suggested other 4 statistics for model assessment, i.e., TRE, ASE, MPE and MPSE. The exprssions of above six statistics can be found in the attached file. Hope it is helpful to you.

Guilherme Sobrinho

Dr. Zeng, thank you for your reply. Would you mind to send me the references for the statistics you are suggesting? In the case I am working the models are linearized prior to regression, what allows me to use the statistics you are suggesting. Also I was suggested to use the AICc ranking for multimodel selection, which is based on informational theory, not in regression statistics. Have you ever used such criterion?

Weisheng Zeng

The solution of parameters for multivariate analysis or silmutanuous equations is based on the information criterion such as AIC, but for a special model you selected, it is better to use other statistics, such as R^2, SEE, MPE, for assessment of goodness-of-fit. You can use the common statistical software such as SAS to estimate the parameters of a equatiion system, and then, based on the estimates of parameters, the statistics for model assessment can be calculated for each equation. One of my paper is attached for reference.

My frequency output shows no errors, however, there's no frequency as a result on gausview. Does anyone know what I'm doing wrong?

What analyzes do you use to compare biodiversity in a square, at two different times?

Why does cyclic voltammetry not stabilize after 100 scans for ethylene glycol?

How I can explain about miRNA?

How could I have the structure of microRNA-183 in 3D to visualize its structures?

What are the most commonly used indicators for performance measurement & management of sustainability-driven open innovation projects?

Gamma symbol in old chemistry papers?

Is it possible to conduct a meta-analysis of hazard ratios of studies with overlapping populations?

What could be the reason for this type of melt curve plot? Should I consider it a double peak?

In which paper can one find the calculus demonstration for Grenoble's method for uplift piles?

Is there an alternative to a multinomial regression which allows the DV to be non mutually exclusive?

In order to run Multinomial Logistic Regression, is it required that the data be in the long format?

If we are using snowball sampling technique, how do we justify the true representativeness of the sample statistically? is there any statistical test?

How to report results of Generalised Linear Mixed Models in a journal article?

What are possible strategies can be used to analyze data under sequential explanatory mixed method approach?

Request a single Lecture notes for math as detailed as this that I can find in one place?

Why 3 replicates for most biological assays? Is it enough to examine the data fits normal distribution?

Normality assumption for linear regression is The assumption of normality is whether for residual errors or predictor variavble?

Posthoc test lettering in JAMOVI?

SAS Generalized Linear Model for trial/event anaysis and not survival (time to event) analysis?