A segmentation process gives phonetic segments of variable length, then this segments are labeled one out of many phonemes(classification). Just i would want to know how this (segmentation+classification) task is different from recognition process.
"Acoustic decoding" ( aka "acoustic recognition") is precisely this segmentation + classification process. But ASR (automatic speech recognition) at present involves much more than this. Which amounts, more or less, to seeing whether the labelled segments you obtained make sense within a particular language (or rather a sample of a particular language).
I completely differ from Klaus Schuricht. This has been the traditioinal view for almost 30 years and now, speech recognition has come to a dead end. Evidence? Nuance sells separate ASR systems for Americal English, British English and Indian English. If people do not attempt a radically different approach, Nuance needs to come out with versions for children, old people, et al. even for English and this will be a never ending stuff.
The only possible solution is language independent, and vocabulary independent segmentation of phones, and then the application of the phonological constraints of the particular language or languages, vocabulary, semantic analysis, etc.
The current ASRs start with a huge vocabulary, and match the sequence of feature vectors derived from the input speech to a sequence of words picked up from this stored vocabulary, maximizing the posterior probability, using n-gram statistics of the words, etc. But it is generally BLIND to the sentence structure, syntax and semantics. Thus, this technology is anything but scalable.
It is left to you and me to do a great job and provide all these additional capabilities to make it a REAL recognition system. Unlike human beings, the current ASR systems do not "recognize" the speech. They simply do a best approximation transcription.
Further, most Indian speech is NOT monolingual. We mix two or more languages in a single utterance. To handle such multilingual utterances, the only way out is to do phonetic segmentation and then apply higher level knowledge.
In ASR acoustic signal are segmented in duration 10 - 15 ms. These parts are called frames. They have overlapping (neighboring 2 frames are overlapping) and extracted features (MFCC) from each frame.