The same model provides 0.94 R2 value for one test set (9 observations) while 0.73 R2 value for another test set (95 observations), however, 0.73 R2 is associated with lower RMSE and MAE. how to explain this situation?
With so few observations (n = 9), there will be a lot of sampling error plus the potential for outliers/extreme observations to have a large influence on R squared. With a larger sample (e.g., n = 95), there is less sampling error and less potential for extreme cases to influence the results. I would trust the estimate of .94 (for n = 9) less than the estimate of .73 (which is based on a more substantial sample size).
The model is based on 520 training dataset (R2=0.85) and 173 cross-validation dataset (R2=0.86). However, when it is applied to these two test datasets, one with nine observations shows 0.94 R2 while the other dataset with 95 observations shows 0.73 R2 with lower values of RMSE and MAE than the first test dataset and also than the training and cross-validation datasets. How to explain this situation observed on the test datasets?
The situation you described can occur when evaluating a predictive model on different test sets. It is possible for the model to have a higher R2 value on a smaller test set but a lower R2 value on a larger test set, even if the lower R2 value is associated with lower Root Mean Square Error (RMSE) and Mean Absolute Error (MAE).
R2 (coefficient of determination) represents the proportion of the variance in the dependent variable that is explained by the model. A higher R2 value indicates that the model captures a larger portion of the variability in the dependent variable.
In your case, the model might have performed well on the smaller test set (9 observations), resulting in a high R2 value of 0.94. This means that the model explains a significant amount of the variance in the dependent variable in that particular test set.
However, when the model was evaluated on the larger test set (95 observations), the R2 value dropped to 0.73. This indicates that the model's explanatory power decreased on the larger test set, and it explains a slightly smaller proportion of the variance in the dependent variable.
Although the R2 value decreased, the lower R2 value could still be associated with lower RMSE and MAE values. RMSE and MAE are measures of the average prediction errors of the model. A lower RMSE and MAE suggest that, on average, the model's predictions are closer to the actual values in the larger test set, even though it explains a slightly lower proportion of the variance.
This situation could occur due to the inherent differences between the two test sets. The smaller test set might have had observations that were more predictable or representative of the model's training data, resulting in higher explanatory power (higher R2). The larger test set, being more diverse or containing different patterns, could have led to a decrease in the model's ability to explain the variance fully.
It's important to consider the limitations of using R2 alone as an evaluation metric. R2 does not capture the full picture of model performance, and other metrics such as RMSE and MAE provide additional insights into the accuracy of the model's predictions. Therefore, it's essential to examine multiple evaluation metrics and consider the specific characteristics of the test sets to interpret and explain the observed differences in performance.