I have an assessment vendor who is claiming that his AI scoring works almost as good as 'trained assessor' scoring on generated responses on language assessment.
We have some open ended questions and recording of test taker's answers on those questions. These are then scored by trained 'Human Evaluators' (2-3 evaluators independently evaluating each recording) and by machine.
The vendor is saying that 40% mismatch between Human and machine evaluation is acceptable (mismatch = difference of more than 1 point of a 6 point scale) and 30% is good, 20% is very good and 10% is acceptable. For me 10% is the maximum we can accept, beyond that it seems problematic.
I am looking for research references in this line. Any help/reference is most welcome and will be highly appreciated.
Thanks in advance.