Using an inter-rater measurement would be good. Have several other translators sample and verify the work done. The result may also indicate the reliability of the translation done.
It is reasonable to assume that testing the validity of certain interpretations provided by translators is in effect the evaluation of translator's competence. Depending on the degree of translator's experience, there are different focal points that mark the quality of a given translation ranging from the literal word-level translation to complex reader-oriented transfer influencing various levels of interpretation. Accordingly, establishing certain criteria for evaluating a translator's interpretation of a text is a top priority. To this end, we should observe several things. First, we should specify the criteria which characterize an acceptable interpretation. Second, we should identify the nature of translation errors produced by the translator. Third, we should determine the relative impact of the translation errors on the message conveyed. Essentially, the need to assess quality not only at the linguistic but also the pragmatic level is very important. fourth, basing quality assessment on text linguistic analysis to specify the exact nature of deviations. Finally, assessing interpretations in terms of scenes and frames to provide a psycholinguistic interpretation is also very crucial.
You might want to look at the grading rubric used by the American Translators Association for grating the ATA's certification exam. It is available in the ATA website: atanet.org
When I taught legal translation in the University of Chicago Graham School's Translation Certificate program that was the rubric that all our faculty members and graders were required to use. As mentioned above, inter-rater reliability is extremely important to ensure that the results are comparable.
Generally, there are two methods to assess the quality of interpretations. The first method is error analysis of interpretation. Basically, you analyze an instance of interpretation and identify erroneous renditions (e.g., omission, deviation, misinterpretation) based on a certain typology of errors. This method is usually time-consuming and requires consistent categorization of errors by evaluators. However, it has the potential to provide a detailed and nuanced understanding of an interpretation. The other method is to use rubrics-based rating scales. Rubrics or descriptors of typical profiles of performance are provided in each band of a scale. When assessing interpretation, evaluators need to determine the best fit between the characteristics of a given interpretation and scalar descriptors associated with a band. Compare with the first method, this method is more holistic and potentially time-saving. You may choose either method to suit your needs.