Statistical models are not physically consistent with precipitation, evaporation, winds and temperature. As such they can not depict atmospheric circulation changes with future climate. Physical downscaling also has its problems as it adopts the biases in the driving climatology. However, this can be handled by assessing anomalies rather than absolutes.
Many Thanks to those who answered this question. I would also have expected a physically based land surface model to easily outperform any simple statistical model. However, unfortunately, land surface models seem to have some fundamental problems so that they ALL under-perform. This has been discovered by the PALS Land sUrface Model Benchmarking Evaluation pRoject (PLUMBER). A peer-reviewed paper should be released soon but in the meantime more information can be found from this presentation: http://www.wenfo.org/ozewex/workshop/wp-content/uploads/2014/11/Abramowitz.pdf
The Paper "The plumbing of land surface models: benchmarking model performance" is available from http://journals.ametsoc.org/doi/abs/10.1175/JHM-D-14-0158.1 .
My favorite quotes from the paper:
"the indifferent performance of the models at sites with restricted soil moisture questions the current methods used in the LSMs for representing the stomatal control on transpiration"
"the LSMs do not appropriately use the information available in the atmospheric forcing data when estimating QH and QE"
The full Abstract:
The PALS Land sUrface Model Benchmarking Evaluation pRoject (PLUMBER) was designed to be a land surface model (LSM) benchmarking intercomparison. Unlike the traditional methods of LSM evaluation or comparison, benchmarking uses a fundamentally different approach in that it sets expectations of performance in a range of metrics a priori – before models simulations are performed. This can lead to very different conclusions about LSM performance. For this study both simple physically based models and empirical relationships were used as the benchmarks. We performed simulations with 13 LSMs using atmospheric forcing for 20 sites and then examined model performance relative to these benchmarks. Results show that even for commonly used statistical metrics, the LSMs’ performance varies considerably when compared to the different benchmarks. All models outperform the simple physically-based benchmarks, but for sensible heat flux the LSMs are themselves outperformed by an out-of-sample linear regression against downward shortwave radiation. While moisture information is clearly central to latent heat flux prediction, the LSMs are still outperformed by a three variable non-linear regression that uses instantaneous atmospheric humidity and temperature in addition to downward shortwave radiation. These results highlight the limitations of the prevailing paradigm of LSM evaluation that simply compares a LSM to observations and to other LSMs without a mechanism to objectively quantify our expectations of performance. We conclude that our results challenge our conceptual view of energy partitioning at the land surface.