I have audio file with me and also the text data for that audio. I want to map the text with audio or in short want to highlight the text with audio stream.
I don't want to use (text-to-speech) as I have audio with some background music. (Android)
TTS is your best bet. Run a search for "dictation software" - there are a ton of options out there and some are more efficient than others / some are more affordable than others.
but as I have mention there will be music with words in the audio file so I am not sure will it work with my scenario or not but for sure let me try :)
According to my volume, Auditory and Visual Sensations, Springer, NY, 2009,
I would like to suggest you that the minimum effective duration of ACF of speech is about 2 ms and 20 ms of music. To analyze ACF signal duration 2T is selected for speech about 40 -60 ms, but for music 2T should be 0.4 - 0.6 s.
Thus, you can set 2T ~ 40 ms for extracting ACF factors of speech signals.
And then you can try to reproduce speech signals by use of the ACF factors.
I was referring to "TTS" as a blanket term for all text-to-speech software. Unfortunately, I'm not particularly familiar with all of the options out there, but there are several very advanced pieces of software out there that pick up much more detail than the dictation programs included with mobile devices. Thus, I suggest you search this product market more extensively - I feel like it's the right track for you, if what you need exists. Another option would be to hire a talented software developer ;)