There is a simple telephone-based dialog system based on Asterisk + UniMRCP + Sphinx. The issue is that the system recognizes input very, very badly, despite that the accuracy for offline recognition on the same task is around 95%.
The vocabulary is small (about 20 words), possible input is defined with JSGF grammar. The acoustic models are trained on telephone speech, recorded deliberately for this task.
This works fine when we evaluate the models in batch mode. However, after plugging the models to the real system, recognition errors are enormous.
What might be the reason?