Dear Iman, If you are calculating the R2 based on the observed data (and predict by the model) it will represent the fraction (ratio) of the total variation that is explained by your model: R2=SSmodel/SStotal, being SS=sum of squares. The RMSE is the square root of the sum of the squared residuals divided by n, therefore the sqrt of the average of squared residuals, that is, a measure of the variation not explained by the model.
For linear models the R2=SSmodel/SStotal=1-SSresidual/SStotal and RMSE=sqrt(SSresidual/n) and as such the higher the R2 the lower the RMSE.
I may be missing something but I can not figure out how R2 and RMSE may increase simultaneously.
Are you willing to share additional details (for instance observed and predicted data) as well as the reported R2 and RMSE?
As you know MSE and RMSE only depends on the suitability of models. When your data were predicted by models accurately, summation of differences between measured data and predicted data will be low.
R-square depends on the suitability of models and total sum of square or variability of data. Thus in analyzing two data set with different variability, with equal RMSE the R-square may be different.
The other factors are the number of variables and the number of data that would be used in each model. In comparing two models these factors are not necessarily equal. Each model had specific condition and limitation for retaining and using variables.
Dear Ali, I guess we are assuming the same data sets but different models. Could you please clarify the reference you made regarding different parameters in the models as it is not clear to me how the RMSE or the R2 calculation depend on the number of parameters/variables in the model. Thanks in advance, Luis
Actually, I used the same data set for both the models. The only thing that I just realized was the different type of cross-validation I used for the models. For PLSR the cross-validation was Full (leave one out) but for the ANN it was random (matlab nftool dafault).
Can the difference between cross-validations make the confusion?
Dear Iman, Your first question was misleading as you're not dealing with R2 but with Q2 (the prediction ability of your models). In that case, Q2 and RMSE do not necessarily change in opposite directions, that is, you may have a higher Q2 and higher RMSE or vice-versa.
Additionaly, the validation procedure used is obviously a factor to take into account when comparing the suitability of models but may not be the actual reason for the differences (or the only one). Did you checked the residuals?
Lastly, I should make clear that I don't fully agree with Ali answer (if I understood it properly) as neither the R2 or the RMSE reflect the suitability of the model. Therefore I consider incorrect the statement "As you know MSE and RMSE only depends on the suitability of models. " because it also depends on the data precision. In fact, RMSE should only be used when the fitted model was previously considered as "valid" or suitable - no lack-of-fit.
R2 and Q2 do not depend exclusively on the model but also on the data variability (precision). A proper/correct/suitable model may lead to a "low" R2 or Q2 if the data shows low precision. A statistical wrong/unsuitable model can lead to a high R2/Q2. Take a look, for instance, to the "Anscombe quartet" (google for it). If needed I can send you some references and/or examples.
The first quation was correct and I am dealing with R2 and RMSE.
There might be another posible answere for that and thats is the data pre-paration which is done in nftool (ANN, MATLAB) by default. I think there is a min-max normalization process in nftool which changes the scale of data and the different RMSE might be due to the different scales of the data in PLSR and ANN. While, R2 can be same for both.
Do you think this can be the reason?
By the way, I do appreciate it if you send me the refferences. Always nice to learn more.
In that case, the normalization step you mentioned will be probably the reason (or at least one of the reasons). The RMSE is scale dependent but the R2 is not - if you multiply all the data by 10 you still get the same R2 but the RMSE will be 10 fold larger.
Going back to your first question, if you are reporting the R2 (not Q2 or R2prediction) then my answer will be the same as previously: using the same data, different models may (and most certainly will) give different R2 and RMSE's. The ones with larger R2 will show smaller RMSE's (assuming no scaling was done). However you should be careful on how you interpret the estimated values. The larger or smaller R2 or RMSE will not give you an answer regarding the suitability of your model - unless you are just trying to describe what you already know and you will, most certainly, end up choosing wrongly an overfitted model as being the "best one". The model validation (by using an external set of data not used in building the model or by using cross-validation or any other approach) will give you an idea on how good/suitable is your model predictive ability - and this is one of the the critical aspects in multivariate analysis. I'll send you some references later on as I'm out-of-office until next week.
Yes, I think this is the source of the problem. As you mentioned, R2 is always same. So that is the reason why the higher R2 has the higher RMSE. Because the RMSE is reported on a bigger scale. ( if reporting on the same scale then would be smaller for the higher R2).
I did consider an external test with 90 samples (training set had 400 samples) for both the models to check how practical the models are.
Thanks for the information and the very nice discussion.