I have two speech signals coming from two different people. I want to find out whether or not both people are saying the same phrase. Is there anything that I can directly measure between the two signals to know how similar they are?
I guess you need to know more about speech recognition. A similar question was asked in SOF and people have given several recommendations. You may find it useful to refer to the following page:
The process of speech recognition is very complex and depends on may factors, for example:
1) If comparing female to male. There is usually a marked difference in frequency based on gender.
2) The frequency range of an individual- some individuals have a higher frequency range than others depending on many factors in their sound production apparatus
3) Whether there are anomalies in speech- some types of anomalies occur due to heavy drinking and forcing the vocal cords
4) Phoneme to word mapping depends on regional differences
5) background noise- causes the Lombard effect
6) Speed of utterance - causes coarticulation
These are some of the factors that come immediately to mind that would make the signal comparison useless without further processing(and even then).
There are many confounding factors that make this process complicated. I give you some examples: consider you have a recording of your own voice recorded in a sound proof room saying "OPEN THE DOOR", and you would like to use that recording as the reference to which other voice commands are compared to take an action to open the door, for example.
Now, if you utter the same utterance but in a noisy environment, the two recordings are no longer the same.
If you change the room and record it in a reverberant room, the two signals are no longer the same.
If you say the same sentence but in different speed (speech rate) as you uttered the reference one, the two signals are no longer the same.
If you utter the same sentence but in different rhythm as you uttered the reference one, again, the two signals are no longer the same.
If all or some of the above mentioned factors happen at the same time, again, the two signals are no longer the same.
Now, imagine that you want to compare your reference signal with another person's recording of the same sentence. If both recordings are recorded in a similar environmental condition (same room, same equipment) and the same rhythm and rate, again, the two recordings are not the same.
Age, gender, health condition are other confounding factors that influence the signal.
Considering the formants of the two signals and comparing them using some similarity measures could be a very simple and quick solution. But unfortunately, they do not provide a good result since, for example, the similarity measure of two completely different sentences recorded in a particular acoustic environment can be relatively higher than two roughly similar sentences recorded in different environment, or if in the second recording the speaker utters the similar words than the reference recording but in different order.
To deal with these factors and variabilities, you might need a model (such as hidden Markov model, Gaussian mixture model) to capture the acoustical characteristics of the signals (in some relevant feature space such as cepstral domain or time-frequency domain) and to relate the segments of a signal to the language unites, and also you need a language model to link the unites to recognize the sentence. All these procedures are covered under the speech recognition field.