Thanks for the reference Frank T. Edelmann . So, based on the figure mentioned on the page (Fig 8.14 in Ivesic et al), it seems that, when cross-validating the data, higher number of model parameters does not necessarily mean better prediction/less error. Right? But, is not there any relationship between the number of parameters and the error when doing cross-validation (it may depend on the data and the model I guess)? any references? (I also could not
find the original reference "Ivesic et al"), do you know where it is?
There is always a tradeoff between bias and variance of the estimation during model fitting. That means with higher model parameters we may reach to a low bias estimate but this may increase the variance of the estimation testing on the new dataset. Cross- validation methods such as K-fold or leave-one-out techniques are great choices for finding optimum bias/variance point in model fitting (Less model complexity with less estimate error) . You can find a great tutorial about applying cross-validation techniques in the below link:
Yes, it does. You need to use out-of-sample MAE or MSE depending on how you optimized your predictions. Here's a framework for a forecast evaluation setup:
Chapter Forecast Evaluation Techniques for I4.0 Systems
The framework can be adjusted depending on your setup.
Thanks for your great explanations and references. I checked out the references. However, I am not still sure if I can make my visioned conclusion yet. Let me be more specific.
Assume that there are two models, one with 3 parameters (degrees of freedom; DOFs) and the other with 2. If we show that the first model (the one with 3 parameters) fits the data better and predicts the out-of-sample data more accurately, say using the inner loop in the nested cross-validation procedure mentioned by Ramtin Zargari Marandi , is not there any way that the higher number of parameters of the first model has contributed to its higher accuracy? I guess, my question is "how (if at all) does cross-validation necessarily avoid/block the role of DOFs of the models?". The worst case scenario can be that there is no relationship between the DOFs and out-of-sample predictive power; but is that the case (does not seem to me, I might be wrong)? I do not want to write more, but can if necessary. Thanks
Ok, great! So, Abed Khorasani, Frank T. Edelmann and Ramtin Zargari Marandi based on what you guys mentioned can we conclude that there is no predictable relationship between DOFs and prediction bias and variance?
Therefore, if somebody shows better performance (which is usually reported by error/bias) for the 3-DOF model vs. the 2-DOF model and they have done cross-validation, should we accept it from them? Or should they report both bias and variance (over validation folds)? or have undertaken some other specific procedures?
"model" is a generic term so you need to be more specific about the kind/type of model you want analyze. E.g. in the case of regression the only parameter is the number of independent variables. In geostatistics the "model" is a variogram or covariance function. In practice the software will only include a small number of model types (e.g. spherical, Gaussian,Exponential, ). For each of those the number of parameters is fixed
Cross validation might have different meanings depending on the application.
That is a good point Donald Myers . By model I was thinking about mathematical equations ranging from simple low-order polynomial equations to higher order fractions which may incorporate linear and non-linear terms/functions. Such rather simple and interpretable models are quite frequent in Neuroscience where the goal is to know what role each element in the model plays and what it represents. See for some examples: Article The Normalization Model of Attention
. I come across of such models, which usually differ both in the number of parameters and in the model structure. Despite these differences, researchers usually make claims about the superiority of one model over the other, saying that they have done cross-validation (CV). But does CV avoid the complexity of the model to contribute to better fitting?