Is multivariate regression more useful for analysis than for prediction?
Consider these two internet references:
Reference 1:
http://www.public.iastate.edu/~maitra/stat501/lectures/MultivariateRegression.pdf
Reference 2:
http://www.ats.ucla.edu/stat/sas/dae/mvreg.htm
Slides marked pages 533 to 539 in reference 1 seem a good summary of basic multivariate regression theory.
My uneducated/inexperienced suspicion before looking into the use of multivariate regression was that multivariate regression would not be justified for small samples, as it would seem too little information to model with more complexity. I still do not know if that is a valid assessment, but I do see in reference 2, near the end, under "Things to Consider," the following comment: "Multivariate regression analysis is not recommended for small samples."
From looking into this topic, including the two references here, it appears to me that this is more useful for analysis, than for predictions needed, say, to provide Official Statistics of the type with which I am familiar, for periodic reporting on the energy industry. It would seem better for analyzing relationships, considering examples given near the beginning of reference 2, than for prediction, when publishing tables of estimated totals and their relative standard errors.
Further, because the trace noted on page 538 of reference 1 is to be minimized, it would seem to me that this might provide better overall predictions, but for a given dependent variable, it could be detrimental. If so, what if that variable represented, say, the most important question on a survey? For the continuous data collected on the finite population establishment surveys of the energy industry in my experience, I think that this could be a problem, and I would guess it likely to be a problem for many other applications as well.
However, although I thought that the trace noted in reference 1, on the slide numbered "538," seemed to explain the multivariate regression theory to me, I also saw, not far from the beginning of reference 2, under "Analysis methods you might consider," the following: "You could analyze these data using separate OLS regression analyses for each outcome variable. The individual coefficients, as well as their standard errors, will be the same as those produced by the multivariate regression. ..." - Huh? - If the trace noted in reference 1 is to be minimized, not individual equation sums of squared estimated residuals, then I don't see how this statement from reference 2 can be true. ??? - Wouldn't there be some tradeoffs involved? - This is a second question that is included here. Though it may not be too closely related to the question as stated above, I think it is important for understanding multivariate regression, to be able to answer the stated question.
Since we are to minimize the trace, where the diagonal elements are the sum of squared 'errors' for each regression, just as we would minimize those individual diagonal elements, if looking at individual regressions, perhaps that is a reason that this could be considered more analytical. If concentrating on prediction, the sum of squared 'errors' might not be as interesting as the estimated variance of the prediction errors for each dependent variable.
(Note OLS is considered, but generally there is heteroscedasticity. So, whether we use multivariate multiple regression, or a series of univariate multiple regressions, I think heteroscedasticity - see WLS - should be considered, to be more realistic.)
It appears to me that one should be cautious not to overparameterize by using multivariate analysis, if it is not required. (Perhaps this might be something of a corollary to Ockham's Razor.)
Given the references, and my assumptions, and my knowledge of energy data, it appears that multivariate multiple regression is much better suited to biostatistical analyses than to prediction for energy data, and other Official Statistics. -
Thoughts? Comments?