My answer might not be very accurate but, by using a combination of the pitch, detectable breathing patterns based on emitted sound, attribute, basically factors affecting speech recognition. but I don't know how accurate it would predict the final outcome, maybe by using numerous samples of various labeled speech audio, "supervised training that is" to see if the approach would provide a desirable prediction output.