if I am not mistaken, you are talking of the use of Mel-frequency cepstral coefficients (MFCCs) for automatic speech recognition (ASR) (?); if this is the case then there are actually two questions in your question : 1) why using filters based on a Mel scale? and 2) why switching from spectrum to cepstrum?
The main reason for point 1 is that human perception of pitch does not follow a linear scale. ASR systems often tempt to mimic this behaviour by using larger filters for extracting speech features in higher frequencies, using a Mel-scale-related function for enlarging extraction filters as a function of frequencies.
The main reason for point 2 is that the spectral data contains information on both the source (i.e. vocal folds) and the resonators (the speech articulators): the two signals are convoluted. By using a log-transform and switching to a cepstral representation, the information related to the source can be discarded, which allows to get more precise and speaker-independent information.