As pointed out, the most common measure is the so called word error rate (WER%). Such a performance is computed by comparing a reference transcription with the transcription output by the speech recognizer. From this comparison it is possible to compute the number of errors, which typically belong to 3 categories:
-Insertions I (when in the output of the ASR it is present a word not present in the reference)
-Deletions D(a word is missed in the ASR output)
-Substitutions S (a word is confused with another one)
WER= (S+D+I)/N
Where N is the number of words in the reference transcription.
The main issue in computing this score is the needed alignment between the 2 word sequences. This can be obtained through dynamic programming, using the so called Levenstein distance. Fortunately, you can find on-line several tool to compute it…
Although Mirco is correct and indeed many papers use this measure, I just want to add one point of caution. The formula which is given is normalized against the prompt, so it is possible to get error rates that are greater than 100%, using this formula -- i.e., when I > D, namely when the length of the results is greater than the number of words in the prompt. Therefore, for performance comparison, say among algorithms or parameter changes, it is better to work directly with the S, D, and I and not read too much into the WER given by this equation. Alternatively, you can devise different nonstandard normalizations to aid you in making decisions on your algorithmic enhancements.
The conventional metric is WER (described above). It can indeed be greater than 100%. But if it is, your problems are more serious than deciding on a metric. It's usual to tune decoder parameters such that insertion and deletions balance; if they don't you have a mis-tuned decoder.
For some languages, such as Mandarin, the metric is often CER -- Character Error Rate.
Then there's Utterance Error Rate -- the rate for a completely accurate transcription. This matters in some applications; for example, your digit error rate may be 1%, but what if your application uses inputs of 16 digit strings (a credit card number)? Now the effective error rate is ~15% and the accuracy is not so good.
If you'e doing something like information retrieval, WERs up to 30-35% give performance equivalent to that for a correct text transcription.
Finally, if the application is a dialog system, WER can rise to ~35% but the system may still be able to achieve an acceptably high completion rate (since it can fix errors though clarification, etc.)
Also bear in mind that for some applications such as Key Word Search the WER is tangentially informative. What you actually want to know is something like the Lattice Recall rate.