In terms of practically useful applications, do you know significant applications for a system that can recognize speech that contains words from more than one language?
In the Indian context, people's speech is rarely monolingual. While speaking in an Indian language to their friends or family, they casully insert English words. When they speak in Englsh also, occasionally they insert Hindi or other local state language. It is a good research challenge to try to recognize this speech.
Dear sir, I remembered the discussion with you about language neutral ASR system. What I guess is that it is very difficult to recognize words without language model. But, for the purpose you suggested here, a multilingual ASR (not language neutral), which has some mechanism for identifying language on supra-segmental level, applies the ASR for that particular time segment, can do such thing. I have no good idea what those suprasegmental features can be.
Thanks for your reply. Research is taking on challenges that look unsolvable. I have about 10 years of service left and I am going to put my brain, energy and focus to this work. We have recently published in IEEE ASLP and JASA methods to detect closure- burst transitions and epochs using simple time domain features and classifiers that do not require training and the results are comparable or better than the best in the literature. We have shown that these techniques work on English, Kannada and Tamil . I have a strong feeling that most researchers in speech today have simply believed whatever some experts opined in the past, rather than trying to understand the signal deeply and coming out with better features, that are inspired from knowledge of the signal, its production and perception.
I do agree that once phoneme recognition is complete, we need to bring in language model to recognize the words; however, that model need not be limited to simply n-gram probabilities of successive words.
Dear Sir, good evening. I have relatively shallower understanding of the signal than those trained in signal processing but have good observation of signal shapes while manual sound annotation. On that basis I can say that distinct phones can be identified relatively easier with signal shape and spectrum. But I have also some experience that one word listened 50 times is perceived at least 4 ways from the same listener (myself). There is confusion even in number of the phones. Those cases seem more difficult to me for teaching to machine.