What types of features can help to recognize speaker/speech more accurately ? Why we need some additional features for speaker recognition? What is a recent trend in this field? Is any literature available for recent trends in speech recognition?
The DNN-free approach : MFCC/LFCC (+ Delta + DeltaDelta) features [1] are usually used. More robust alternatives such as MHEC [2] and PNCC [3] have also been developed in the past years. Another alternative is NMF (non-negative matrix factorisation) applied directly on the spectrogram [4].
The DNN-based approach : A deep neural network can be trained to either :
learn a new unsupervised representation of the acoustic features : an auto-encoder/denoising auto-encoder/VAE [5] is trained using stacked mel-filterbank or mfcc features as input/output, then the activation of one of the hidden layers is used as a new representation (also called bottleneck features when the dimension of the new representation is very low compared to the original)
lean a more sophisticated (and hopefully discriminative) representation by training a DNN as a phonetic classifier at the frame level (+context) [6] and using the activations of one of the hidden layers as a new representation. A detailed review can be found in [7].
This list doesn't exclude acoustic–linguistic features like formants, rhythmic features and high level features like prosody. It's important to know that the choice of features can be application-dependent ; different features can be chosen for a general-perpose automatic speaker recognition system vs a forensic speaker recognition system.
---
References :
[1] Hansen, John HL, and Taufiq Hasan. "Speaker recognition by machines and humans: A tutorial review." IEEE Signal processing magazine 32.6 (2015): 74-99.
[2] Sadjadi, Seyed Omid, Taufiq Hasan, and John HL Hansen. "Mean Hilbert envelope coefficients (MHEC) for robust speaker recognition." Thirteenth Annual Conference of the International Speech Communication Association. 2012.
[3] McLaren, Mitchell, et al. "Improving speaker identification robustness to highly channel-degraded speech through multiple system fusion." Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013.
[4] Joder, Cyril, and Björn Schuller. "Exploring nonnegative matrix factorization for audio classification: Application to speaker recognition." Speech Communication; 10. ITG Symposium; Proceedings of. VDE, 2012.
[5] Zhang, Zhaofeng, et al. "Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification." EURASIP Journal on Audio, Speech, and Music Processing 2015.1 (2015): 12.
[6] Lei, Yun, et al. "A novel scheme for speaker recognition using a phonetically-aware deep neural network." Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014.
[7] Matějka, Pavel, et al. "Analysis of DNN approaches to speaker identification." Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016.
There are various features regarding speech and speaker recognition which are available in literature. Sometimes the concatenation various feature set proves to be beneficial in the recognition task. However, now a days, where deep learning is the recent trend, deep features can be used for your purpose.