Dear Professor Amin, thank you for your answer. Would you recommend some references for me to learn more about such factors? And yet, would you suggest any other methodologies for multimodel selection (as AIC/BIC, AICc, etc)?
You can use a stepwise regression (forwards, backwards, or stepwise) to select a model. The method can use r-squared, p-values, AIC, or possibly some others depending on software. The problem is that if the issues Raid raised are present in your data, then you can end up with a local maxima. You get a model with an r-squared of 0.37 if you start with all of the variables, but if you remove "this one" so that the program cannot select it, then you get a better model (or maybe just a different one).
It also depends on how you want to define a better model and what constitutes better. Is a model with a p-value for F test of 0.006 really better than one with a p-value of 0.007? Numerically, yes, but does this difference have any real impact on your interpretation of the data or the utility of your research? How does your interpretation change between the two "best" models?
You can also run the selection process several different ways, using several criteria (changing the value to enter and value to remove in a stepwise regression). How does the best model change depending on the method used for selection? If they all disagree, then you will need to figure out why (see Raid's answer for a good list of potential causes).
It may be appropriate to discuss several of the better models. This might be a great start to planning future research.
Dr. Ebert, thank you very much for the reply. The definition I am willing to use as a first approach for the best model is simply its "fitness", i.e. how well a given model fits to a given experimental data. For such approach it seems reasonable to consider the residuals for a linear regression analysis. I am not considering the variables involved (yet), but the global results. Therefore, I am not performing any sort of multiple regression analysis. Also, I am considering only empirical models that are very well established in literature.
You could try leaving out one or more data values at random, building the model, and then seeing how well the model predicts these "new" values. You could either rebuild the model each time using stepwise regression, or you could use the result from stepwise regression with all the data and just recalculate the model coefficients. You then take the difference between the "new" value and the predicted value, and repeat several thousand times (if you have enough data). This will give a somewhat different answer from just looking at the residuals, though the form of the output will still be observed-expected. This is a Jackknife procedure, very well established in the literature -- though typically the Jackknife method would be leaving out observations to see the effect on estimates of the regression coefficients (or some other statistic).
I can try to explain to you what I happen to be doing for model selection in regression, using SAS. You can use other software package, of course. This may be an imperfect methodology, but it works well for me.
1. Check for outliers and high influence data values.
proc reg;
model y=x1-x10/influence;
run;
I use the covariance ratio COVRAT, R-Student, Hat diagonals, ... etc. to assist me in idenifying data values that can be removed.
2. I check for multicollinearity.
proc reg;
model y=x1-x10/vif collin;
run;
Identify which variables should not be used in the same models.Use the results from steps 1 and 2 in the following steps.
3. Model selection based on good fit first.
proc rsquare adjrsq mse cp;
model y=x1-x10;
run;
Mallow's Cp provides information on how to balance between underfitting (you get bias) and overfitting (you get large prediction variances).
I select a few candidate models that are best for fitting and that show promise to be good for prediction.
4. I obtain for each chosen model the PRESS statistic. Basically, it is the SSE that measures stability of the model with new data.
proc reg; model y= x2 x4 x6/cli p; run;
proc reg; model y=x1 x2 x4 x5/cli p; run;
proc reg; model y=x1 x3 x4 x7/cli p; run;
proc reg; model y=x1-x10/cli p; run;
5. Summarize all results in a summary table for the chosen models, plus the full model. You will be able to have more than excellent model to choose from. I emphasize to students that each "x" is not just an x. There is a cost and there is measurement error, and there is importance in the study ... etc. Having alternate choices for models is very useful.
I recommend not to use any stepwise regression at all. Such model selection methods can be misleading and they do not fully take into account multicollinearity. The F tests from stage to stage are not independent of each other.
The regression text by Montgomery, Peck, and Vining is excellent.
Douglas Montgomery and Geoff Vining were taught regression by Ray Myers at Virginia Tech, and so was I. The above steps were taught to us by Dr. Myers.
If you assess models simply using "fitness", it is reasonable to consider the residuals for regression models, liner or nonlinear. Then, R^2 and SEE are the important statistics. In general, nonlinear model is better than linear one, because linear model is only a special case of nonlinear model. Besides R^2 and SEE, we suggested other 4 statistics for model assessment, i.e., TRE, ASE, MPE and MPSE. The exprssions of above six statistics can be found in the attached file. Hope it is helpful to you.
Dr. Zeng, thank you for your reply. Would you mind to send me the references for the statistics you are suggesting? In the case I am working the models are linearized prior to regression, what allows me to use the statistics you are suggesting. Also I was suggested to use the AICc ranking for multimodel selection, which is based on informational theory, not in regression statistics. Have you ever used such criterion?
The solution of parameters for multivariate analysis or silmutanuous equations is based on the information criterion such as AIC, but for a special model you selected, it is better to use other statistics, such as R^2, SEE, MPE, for assessment of goodness-of-fit. You can use the common statistical software such as SAS to estimate the parameters of a equatiion system, and then, based on the estimates of parameters, the statistics for model assessment can be calculated for each equation. One of my paper is attached for reference.