03 February 2022 6 9K Report

I am trying to build a model that can produce speech for any given text?

i could not find any speech cloning algo that can clone the voice based on speech only so I turned to TTS(Text-to-speech) models. I had the following doubts regarding data preparation?

As per LJSpeech dataset which has many 3-10 sec recordings we require around 20 hours of data. It will be very hard for me to build these many 10 sec recordings. What would be the impact if I make many 5 min recordings. One could be high resource req (but how much), are there any others.

Also is there some way through which I could convert these 5 min recordings as per LJSpeech format

More Ashay Singh's questions See All
Similar questions and discussions