Furthermore, I would like to know if there are some paper about the hours and sample needed to have valid and reliable data using Automatic Speech Recognition.
If we are speaking of building models Julius. For a commercial app, probably Dragon Naturally Speaking by Nuance. You can also try this online tool but it is not very accurate: https://talktyper.com/es/
The required sample of speech for performing an analysis depends on the quality of the recordings and the kind of analysis that you want. For example, if you want to analyse pitch you will need a sample of 10 seconds for analysing the mean (Arantes & Erikson, 2014).
Reference
Arantes, Pablo; Eriksson, Anders (2014). Temporal stability of long-term measures of fundamental frequency. The Journal of the Acoustical Society of America, vol. 135, issue 4, pp. 2428-2428
it is difficult to assess which ASR system is better. It depends on several factors, like acoustic conditions, audio formats, type of speech, domain, language or dialect...
Please find the answer about the amount of training data needed for a minimal system into our paper on the topic of developing a Speaker Independent ASR system for Subtitling:
"Automatic Live Subtitling: state of the art, expectations and current trends", NAB Broadcast Engineering Conference, April 5-10 2014, Las Vegas, USA