Dear Hamou, thank you for your answer. I am dealing with the same issue in the field of road crash prediction models. Using various variables you may fit several different models. And choosing the best between them is really not a simple task. You may use 'single-value' indicators (information criteria) such as AIC, BIC, DIC... or overdispersion parameter (when using negative binomial model). Alternative to 'single value' is some 'continual' indicator such as Cumulative Residuals graph - check this source - https://www.researchgate.net/publication/239438865_Statistical_Road_Safety_Modeling (section 4).
If the measurement errors are normally distributed and the models are properly nested then the F-test would be appropriate. The test is quite sensitive to violations of the normality assumption, so if that can not be independently verified, other methods are more appropriate.
The Aikake information criteria (AIC) and Bayes information criteria (BIC) have already been mentioned. Information criteria in general function by adding some sort of cost based on model complexity to the loss function. One common criticism of these methods is that the penalty for model complexity is arbitrary.
It might also be worthwhile to look at model selection using bootstrap methods. You can either compare the models by bootstrap resampling of the residuals directly or by refitting bootstrapped data sets.
You may also want to look into k-fold cross validation.
Are you trying to compare coefficients between two models, or are you trying to evaluate different models that may have entirely different parameters? Or something else?
I know that temperature and humidity are good predictors of Fusarium oxysporum growth in Montana corn fields and I want to compare that to a similar model for Citrus spot in Florida. Thus I have two models, and they both have the same form. However, something like AIC would be inappropriate for comparing these models.
Your options will be greatly reduced if the models are nonlinear. If the number of replicates are small (
Assuming that the data for y is the same for both equations .... then you only have one equation y=b1*x1 + b2*x2 + b3*X1*x2 + b4*x1*x2^2 and you can continue the polynomial as far as you desire (you can, does not mean you should). The next terms might be b4*x1^2*x2 and b5*x1^2*x2^2. Look at the significance values for the regression coefficients and delete non-significant terms.
At some point you should show your model to a statistician at your university. It is good to work this out for yourself, but there are a large number of details that you need to take care of. Are x1 and x2 correlated? Are the residuals well behaved? Have you plotted the data to see if the model fits?
Simple methods for eliminating terms in a regression model include building the model using forward, backwards, or stepwise selection. These can be based on a p-value, or r-squared. One could also use the Akaike Information criterion (AIC), or any of several other statistics of this type. I might also look at residuals, focusing on the standard deviation of the residuals. You could also look at power, or look at error rates where you eliminate one value from the data, calculate the model and then see how well the model predicts the outcome for the removed value. The better model has smaller errors.
I took seven graduate courses on linear models and regression analysis. It is a vast field,with many issues to be considered. I also recommend giving your research problem to a staristician.
For your exmple submitted two days ago, the answer is :
If all x1 to x4 are independent, no errors, and the relationship between Y and x1 to x4 are linear, errors are independent normal distributed, then a1=b1.
If all x1 to x4 are correlated,
a1: is the residual from linear model X1=a11*x2 as the predictor on the residuals Y=b11*X2 as the outcome.
I.e Resi of Y on x2= a1* Resi of X1 on X2
B1: a1: is the residual from linear model X1=a*x2 +a22*X3+a22*x4 as the predictor on the residuals Y==b11*x2 +b22*X3+b22*x4 as the outcome.
I.e Resi of Y on x2 x3 and x4= b1* Resi of X1 on X2, x3 and x4.
In other word, a1 or b1 are the association “unexplained information of y“ and “unexplained information” by the same predictor or predictors, which depends on how x1 correlated with other independent variable.
For multiple linear regression, this is a master level course. The out line for you is model selection and diagnosis, the outlines are:
The model selection depends on the test hypothesis and data structure. As you mentioned, you outcome dependent variable should be continuous. You do not check the distribution on dependent variable itself. You need check it in the regression process. As many answers mentioned, the residuals are independent and identical normal distributed rather than the outcome itself. You also need check, the collinearrity of independent factors, which should be independent without errors, and the relationship between the outcome and independent factors, which should be linear by assumption. The whole topic is MODEL building and diagnosis problem, there are many way to do it, according your statement I think the following will be easy. Assume the dependent variable is continuous and normal distributed as you stated
1. Check the type of co factors, if some have a lot of missing value or typo, correct them or let them out. Then for continuous factors, check the person correlation coefficients, if Pearson correlation near 1 or -1 among any two, one should be gone in multiple regression. For category factors, check independency of any two. if most count appear in the diagonal of the contingence table, one of the two category variables should be gone.
2. Draw the scatter plots between the continuous outcome and each independent quantitative factor to see the association: if linear trend is shown, the factor is in, if non-linear effect is shown, transformation is needed. If no trend, like random, the independent factor could be out.
3. Suppose the dependent variable should be independent. If the dependent variable is related to time, check the auto correlation, if autocorrelation exists, time series modeling say Autoreg might be used.
4. Do PCA analysis to see if there are still multicolinearity among the independent factors, if some eigen-value is near zero, you may drop one of them, or define a new factor (transformation) .
5. If the sample size is large enough, say 10 (at least) times higher of the unknown parameter number, you can do multiple regressions, you may use auto select option, such as forward backward or best, which will select independent factors for you.
6. Number of parameters: interception: count 1, continuous factor, each counts 1, categorical factor with level of k, count as k-1.
7. Check outliers by leverage or CooksD or Residual, if exists you may delete them or do both model with and without the outliers.
8. Check normality of residuals from the multi variable regression, if violated, do transformation, variance homogeneity exist: transformation on some independent factor, variance homogeneity doesn’t exist: transformation on dependent factor. Those may improve your model fitting.
9. You may use Akaike’s information criterion or Bayesian information criterion or Mallows’ CP to decide how many factors should be in. Using them is better than comparison of R2.
10. Check interaction among the independent factors, the interaction among two quantitative predictors, means there is joint effect; the effect of one factor varies across the level of another factor. If the interaction between an quantitation factor and a category factor, say gender, it means the effect of the quantitative, slope, is different between males and females.
As I metioned, if all x1-x4 are independent, two are similar. Here is an example to show a1 and b1 are total different a1==0.34, b1=7.9 due to collinearity, x1-x4, see followind
When co-linearity exists, the small change by remove or add a predictor or delete a few records, the estimated coefficients may show large changes in magnitudes or even change in sign, the estimated standard error show large change. Table 1 shows Hald’s data with a new response variable generated by Hadi and Liang (Chatterjee and Hadi 2012, Chapter 6). The estimated regression coefficients on x1, x4, x1-x3 and x1-x4 are list in Table 2.
No attachment option, find a1 in Model 1, b1 in model 3, both following factor X1
In my view, the key issue is the research question: are you interested in 1) saying something on x1 and x2; or 2), simply in good prediction of y (housing prices).
For 1) you claim that the effects of x1 and x2 are additive and linear, if you use the model x1+x2. This may not be true, and checking interactions and non-linearity makes sense, especially if sample size is decent (opinions will vary here on what is decent, but say n>100). For 2), adherence to model assumptions implies that you make better predictions, but I would be a bit more lenient, and the simple x1+x2 model may do a reasonable job. See Frank Harrell's book on regression strategies (Springer 2001).
First, I need to better understand what you mean by "multivariate regression". Do you mean the case where there are several correlated response variables (y1, y2, ..., yk), or do you mean the univariate case where there is one response variable and several regressor variables (x1, x2, ....., xk)?
Most likely, you meant the second case.
In order to be able to recommend useful criteria for model selection, it is important to know the main goal for the regression here. Is it "prediction" or is it "fitting"?
If prediction is most important, then you need to focus on obtaining a model that shows stability. Here, a well known criterion is the PRESS, or the prediction error sum of squares. It is a SS error with emphasis on stability with new data values for prediction purposes. Choose the model having the smaller PRESS.
Check for presence of multicollinearity. Obtain the Variance Inflation Factor for each regression coefficient Beta1. Beta2, ..., Betak.
Check for existing influence points and outliers.
Regression analysis does not need normality of the data, but the inferential procedures in regression require normal errors. Modern statistics software packages can test for such a thing. Use Q-Q plots or the basic Shapiro-Wilks Test.
Use only the adjusted R square and not the un-adjusted coefficient of determination, and realize that this statistics is over-used.
There are other issues to check out, but as a good start, after having done all steps pointed out above, choose the better regression model.
If your question was about a comparison between two already chosen regression models, I would go back and verify if all conditions are met.
Explain in more details what your goals are and what the regression is about. Describe the data too.
If you actually are considering multivariate regression models, you need to use some multivariate procedures. Follow texts such as the one by Johnson and Wichern.