Most hydrological experiment stations that I am familiar with measure rainfall and streamflow, and use these to help estimate evapotranspiration. ET is just difficult to measure. There are a few models or approaches that can help estimate ET, but it has been some time since I have done water balance studies. As you probably know or imagine, there is some limits to accuracy in measuring rainfall on a landscape and streamflow in a river. It seems it would be even more difficult to try to measure ET at the river, or even at watershed or catchment scale. I suppose there could be circumstances where there is little or no streamflow, loosing stream or karst geology where deep seepage can affect water balance estimates. If you don’t have any long term data, you might try looking at some of David Rosgen geomorphic channel indicators of bankfull flow coupled with HEC RAS and comparison any stream gauging station data within same physiographic and climactic area. Some may choose calibrating a model by selecting with existing long term measurements of rainfall and flow, etc. within the physiographic area. If you want the hydrological model to target something specific, such as flood severity and frequency, collecting any historical records and looking for channel indicators may be helpful. Aerial photos through Time may help evaluate use and channel changes.
We tested the added value of gridded evaporation products in hydrological model calibration and found that they have a good potential to improve the model performance if they are used adequately (in fact, we tested various calibration settings).
The following papers will tell you more about our work:
Article Potential of Satellite and Reanalysis Evaporation Datasets f...
Article Improving the Predictive Skill of a Distributed Hydrological...
In agreement with Hansen's response above, river gage data typically has much less associated uncertainty than independent (from water mass balance) estimates of actual areal ET, even if actual areal ET is estimated using point measurements (e.g. eddy flux tower measurements). Therefor if you have precipitation data for the area of interest and stream gage measurements at a gage receiving drainage from the area of interest, these stream gage measurements are likely to be useful calibration points, at least at annual and longer time scales. However, at shorter time scales (seasonal, monthly, daily, sub-daily) there may be significant changes in watershed water storage, and your watershed water mass balance model must take into account such storage changes. For these shorter time scales, it could be that estimates (including measurement-based estimates) of actual ET may provide useful calibration points for some watersheds--I suspect that Moctar's citations (above) address this.
I am quite in agreement with James C. Trask and William F. Hansen opinion. In my experience, calibrating a model using discharge observations is easier, although not necessarily optimal at all times. Typically, discharge measurements are direct observations, and carry out much less uncertainties. Also, it is worth poiting out that discharge observations are measured locally, and the whole purpose of your calibration is to tune your parameters so that they match the observed record at the same location, which is somehow easy to achieve.
Actual ET on the other hand is a very complex variable, not easily measured directly, but estimated through various approaches. Also, it is quite variable over space depend much on water availability, potential ET (which behaves as boundary energy condition) and vegetation cover and phenological properties.
However, there is a great advantage in using actual ET. Since this variable is highly variant in both time and space, using this information (if available and accurate enough) for calibration might achieve your model in representing perfectly variations in spatial patterns in your watershed, which might lead to more accurate prediction. This is the penultimate point raised by Moctar Dembélé. Also, interestingly, I have noted that sometimes models can achieve a good representation of surface runoff (using discharge for calibration), but they can misrepresent actual ET and infltration/recharge (either under-estimating or over-estimating each of these components at the benefit of the other). See the paper below for example.
Article A dynamic land use/land cover input helps in picturing the S...
All in all, if both data is available and trustful, using them both might be your best bet.