There are probably some specific measures concerning language speech recognition in smart phones. However, I would start with the known measures and adjusting them considering your specific semantics. Assuming it is insufficient to assess classification results using only Precision and Recall, you could take a look on how my heuristics for assessing classification results work. These are a classes-similarity measure, cost-based measure, classes-number measure in my thesis about emotion recognition -- Thesis Opinion Mining and Lexical Affect Sensing