Say I have two audio files, one with a specific word spoken by a native speaker, the other with the same word spoken by a learner. When I say word, I probably need to make this phoneme.

Using the native speaker - file, I want to detect the differences to the learner's file. Obviously gender difference, pitch, speed, and the like should be ignored.

To make this clearer: the initial phonemes in Jack, chin, gin, etc. are often pronounced wrongly by non-native speakers. Could an algorithm detect this? Or would I classify Jack pronounced by 10 native speakers pronounce Jack, and then feed the learner's Jack?

Is this possible at all? Would TensorFlow be a tool to try?

Thank you for your ideas.

Similar questions and discussions