An intuition about the cepstral features can help to figure out what we should look for when we use them in a speech-based system.
- As cepstral features are computed by taking the Fourier transform of the warped logarithmic spectrum, they contain information about the rate changes in the different spectrum bands. Cepstral features are favorable due to their ability to separate the impact of source and filter in a speech signal. In other words, in the cepstral domain, the influence of the vocal cords (source) and the vocal tract (filter) in a signal can be separated since the low-frequency excitation and the formant filtering of the vocal tract are located in different regions in the cepstral domain.
- If a cepstral coefficient has a positive value, it represents a sonorant sound since the majority of the spectral energy in sonorant sounds are concentrated in the low-frequency regions.
- On the other hand, if a cepstral coefficient has a negative value, it represents a fricative sound since most of the spectral energies in fricative sounds are concentrated at high frequencies.
- The lower order coefficients contain most of the information about the overall spectral shape of the source-filter transfer function.
- The zero-order coefficient indicates the average power of the input signal.
- The first-order coefficient represents the distribution spectral energy between low and high frequencies.
- Even though higher order coefficients represent increasing levels of spectral details, depending on the sampling rate and estimation method, 12 to 20 cepstral coefficients are typically optimal for speech analysis. Selecting a large number of cepstral coefficients results in more complexity in the models. For example, if we intend to model a speech signal by a Gaussian mixture model (GMM), if a large number of cepstral coefficients is used, we typically need more data in order to accurately estimate the parameters of the GMM.
There are many reasons for choosing these numbers of features which depend on the system. So, one main trend is that we try to reduce the number of features in order to make our model feasible for real-time implementation and the lower order coefficients contain more cues about the overall spectral shape of the source.