If the first twenty speech vectors correspond to one particular phoneme, then how are these twenty vectors shared among three states model of that phoneme?
Training of HMM state parameters is done via Forward-Backward algorithm. It "fuzzy-aligns" the training vectors to the HMM states acording to transition, self-loop, and state-output probabilities. It means that at the end of the FB alg. one obatain a matrix/table of probabilities of size TxN, where T is a number of training vectors and N is a number of HMM states.
The same can be done via "non-fuzzy" Viterbi algorith. It produces the best path/alignment of waht vectors belong to what state. But for traning the FB alg. is better. The Viterbi is used for recognition.