when it comes to testing speaking production, we use external examiners (3 or more) and we follow a rubric. After other attempts with electronic devices, we decided to rely on the human capacity to judge learners' performances.
Here is an article based on speaking production and you may like to see my thesis dissertation on my profile, thoug in Spanish.
Hope it can help.
Best,
Laura
Data Angelini et al Student perceptions of gain in telematic simulation (1)
Versant, build by Jared Bernstein, is (to my knowledge) the first industrial success of an automated second language oral proficiency testing system. A much more recent paper with Bernstein as the lead author gives good references to start:
Validating automated speaking tests, (2010), Jared Bernstein, Alistair Van Moere and Jian Cheng. Language Testing. 27 (3) 355-377.
I'll also add
Quantitative assessment of second language learners' fluency by means of automatic speech recognition technology, (2000), Catia Cucchiarini, Helmer Strik and Lou Boves. Journal of the Acoustical Society of America 107
and other papers by the authors.
And
The Evaluation of Second Language Fluency and Foreign Accent, (2011), Chen-Huei Wu. Ph. D. Dissertation, University of Illinois at Urbana-Champaign.
I was just searching for some other literature on CALL in the journal Language Learning and Technology (http://llt.msu.edu/) and saw some on testing but as I was not interested in publications on testing, I did not look through them. However, you may want to use the search function to find them.