What types of features can be used for speaker/speech recognition? why we need some additional features for speaker recognition?

For speaker recognition :

The DNN-free approach : MFCC/LFCC (+ Delta + DeltaDelta) features [1] are usually used. More robust alternatives such as MHEC [2] and PNCC [3] have also been developed in the past years. Another alternative is NMF (non-negative matrix factorisation) applied directly on the spectrogram [4].

The DNN-based approach : A deep neural network can be trained to either :

learn a new unsupervised representation of the acoustic features : an auto-encoder/denoising auto-encoder/VAE [5] is trained using stacked mel-filterbank or mfcc features as input/output, then the activation of one of the hidden layers is used as a new representation (also called bottleneck features when the dimension of the new representation is very low compared to the original)

lean a more sophisticated (and hopefully discriminative) representation by training a DNN as a phonetic classifier at the frame level (+context) [6] and using the activations of one of the hidden layers as a new representation. A detailed review can be found in [7].

This list doesn't exclude acoustic–linguistic features like formants, rhythmic features and high level features like prosody. It's important to know that the choice of features can be application-dependent ; different features can be chosen for a general-perpose automatic speaker recognition system vs a forensic speaker recognition system.

---

References :

[1] Hansen, John HL, and Taufiq Hasan. "Speaker recognition by machines and humans: A tutorial review." IEEE Signal processing magazine 32.6 (2015): 74-99.

[2] Sadjadi, Seyed Omid, Taufiq Hasan, and John HL Hansen. "Mean Hilbert envelope coefficients (MHEC) for robust speaker recognition." Thirteenth Annual Conference of the International Speech Communication Association. 2012.

[3] McLaren, Mitchell, et al. "Improving speaker identification robustness to highly channel-degraded speech through multiple system fusion." Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013.

[4] Joder, Cyril, and Björn Schuller. "Exploring nonnegative matrix factorization for audio classification: Application to speaker recognition." Speech Communication; 10. ITG Symposium; Proceedings of. VDE, 2012.

[5] Zhang, Zhaofeng, et al. "Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification." EURASIP Journal on Audio, Speech, and Music Processing 2015.1 (2015): 12.

[6] Lei, Yun, et al. "A novel scheme for speaker recognition using a phonetically-aware deep neural network." Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014.

[7] Matějka, Pavel, et al. "Analysis of DNN approaches to speaker identification." Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016.

Do you think can be any Uranium bearing rocks in Eastern part of Iran and western part of Afghanistan?

Do you think can be any diamond bearing rocks in Eastern part of Iran and western part of Afghanistan?

What is the difference between mathematical R^4 space and physical 4D unit space?

If Banks do not provide credit facility, what are the options available for FPOs and impact on producer’s income?

Controlling for pupil light reflex when analyzing pupil size time course?

What are a “Farmers Producer Organization” (FPO) and its essential features?

Strugglling with m6A dot blot any suugesstion ?

Do interactions between biosphere, carbon cycle, & water cycle impact global warming & interaction between atmosphere & hydrosphere?

How to get moment output in Abaqus Standart?

How is energy cycled through the Earth's climate system and how do matter cycle and energy flow through the rock cycle?

Broca’s area must be intact for the learning of new movement sequences?

What are the current challenges and future prospects of integrating artificial intelligence into recognition systems for autonomous vehicles?

Help me download paper?

What is the difference between opportunity recognition in entrepreneurship literature and sensing in dynamic capabilities theory?

What is the effectiveness of AI-powered language learning tools in improving language acquisition skills in children with speech and language delays?

I am working on a network for facial expretion recognition and I have problem with the loss function can anyone help?

Is the pure phonemic content related to emotional valence?

What are the challenges of developing technology for real-time speech translation?

Is it really worthy to have "Recognition Certificate" from unknown and unverified source?

Help Needed: How to Develop a Deep Learning Algorithm for Action Recognition in Assembly101 dataset Videos?