I am trying to build a model that can produce speech for any given text?
i could not find any speech cloning algo that can clone the voice based on speech only so I turned to TTS(Text-to-speech) models. I had the following doubts regarding data preparation?
As per LJSpeech dataset which has many 3-10 sec recordings we require around 20 hours of data. It will be very hard for me to build these many 10 sec recordings. What would be the impact if I make many 5 min recordings. One could be high resource req (but how much), are there any others.
Also is there some way through which I could convert these 5 min recordings as per LJSpeech format