I have the samples of phonemes of English language. I want the best method of concatenative synthesis and also the best way to resolve glitch observed in concatenating small units of speech(phonemes)
The way I did it was inspired by an algorithm I found in a book titled DAFX by Zölzer (2002). Basically you calculate the cross-correlation between the end of the first phoneme and the beginning of the second and find the point where they correlate the most. This is where you want to concatenate the phonemes/diphones/units. Based on this point, you multiply the end of the first phoneme by a decreasing ramp and the beginning of the second phoneme by an increasing ramp and then just add them up. I hope you understand the idea.
I understood your idea but while concatenating phonemes of the word, it doesn't sound like pronouncing a word rather its sounds like reading phonemes one after the other. So, I think if I am able to play the signal fast, it will sound somewhat natural to original word. So, I tried to increase the sampling rate while playing, but its losing the quality. So how to proceed with this concatenation approach?
Did you implement the method I suggested? because in principle if you use it and the phonemes are taken from relatively similar contexts it should work okay. Here are some sentences I produced by concatenating diphones exactly how I suggested:
http://voice-research.org/synth/synth.php
For the faster speaking rate, it should be better to use shorter units (phonemes in your case) or use the PSOLA algorithm, I think you are familiar with it.
(I've sent you the Matlab file that I use to concatenate diphones; you can read it and try to understand exactly how it works)