I have a project in which I have given a dataset (more than enough) of 10-20seconds audio files (singing these "swar" / "ragas": "sa re ga ma pa" ) without any labels, nothing well in data ... and I have to create a deep learning model which will recognise what speech it is and for how long it is present in the audio clip (time range of particular "Swar" sa ,re ,ga, ma )
The answesr to questions that I am looking for are
1. how I can achieve my goal , should I use RNN , CNN ,LSTM or hidden Markov model or something else like unsupervised learning for speech recognition ?
2. How to get correct speech tone for Indian language as most acoustic speech recognition models are tuned for English ?
3. How to find the time range ?for what range particular sound with particular "swar" is present in music clip ? how to add that time range recognition with speech recognition model ?
4. are there any existing music recognition models which resembles my research topic ? ,if yes please tag them .
I am looking for full guide for this project as it's completely new and people who are interested to work with me /guide me are also welcome .