What is the optimal time for training/testing the UBM-GMM and i-vector based system?

Hi,

Responding to your last question,

It depends on what you're doing. Speaker recognition is a broad term and in practice, you're generally doing either verification (checking if two recordings correspond to the same speaker) or identification (trying to recognize the identity of the speaker persent in the test segment by comparing it to a set of known speaker models).

In these two contexts, "enrollement data" generally means "known speakers" and "test data" means "suspected/unknown speakers". Enrollement data are some kind of "reference" to which you compare your test recordings (either for verification or identification) and "can" be used to train a scoring model, eventhough in general, a different set is used for this matter, which is called "training set".

For example, In the NIST SRE 2010 (speaker recognition evaluation), enrollement and test data (reference speakers and unknown/suspected speakers) are a subset of the NIST 2010 database, while the training data (used to train the UBM, the T matrix AND the PLDA model) belong to a completely different dataset (eg. NIST SRE 2004, 2005, 2006 / Switchboard II Phases 2 and 3 / Switchboard Cellular Parts 1 and 2 / Fisher English Parts 1 and 2/ ..). You can check the "Experiments and results" section of this paper for example [1].

Now what happens if you add a new speaker ? It depends.

1 - If you are adding a new speaker class to your training data (a set of i-vectors corresponding to one particular speaker), you'll have to re-train your scoring model in order to take this new class into account (retrain your PLDA or re-compute your WCCN matrix, ..) and then use the new model to compare your enrollement/test segments. Generally, this does not happen because you're supposed to use as much training data as possible from the start and stick with the same model for all you experiments (otherwise you'll have to redo all your experiments everytime you add new data in order to have comparable results (scores coming from the same model)).

2 - If you're adding new enrollement data (new reference speakers), then nothing changes. The scoring model is supposed to perform as a black box that provides scores for any new test/enrollement utterances. Once it is trained, it is used the same way for any test/enrollement data.

The important thing to understand is that the expression "speaker class" can have different interpretations when used to talk about enrollement and train data. The former refers to a "reference speaker" that will be compared to in the scoring phase (you can compare a test segment to one or many enrollement sessions [2]) while the latter refers to the speaker classes used to train the scoring model. It's like training a PCA or a regression model, the dataset used to train the model is generally independent from the one you're testing on, but if needed, you can transform your training data using the same model.

---

References :

[1] Bousquet, Pierre-Michel, et al. "Variance-spectra based normalization for i-vector standard and probabilistic linear discriminant analysis." Odyssey. 2012.

[2] Liu, Gang, et al. "An investigation on back-end for speaker recognition in multi-session enrollment." Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013.

Rizwan Ishaq

The total variability matrix really take lot of time, but LDA, WCCN and PLDA are needed to be apply on the training data.....????

Ayoub Bouziane

Hello,

PLDA is trained on the training data ?!... that's mean that we have to re-estimate its parameters after each addition of a new user to our system ?!

@Ayoub: Question, if they are applied on enrollement data, then we only have i-vector classes equal to number of enrollement data, so for new user what we do??

@Hamid, Waiting for the response for the comment.....?

Do you think can be any Uranium bearing rocks in Eastern part of Iran and western part of Afghanistan?

Do you think can be any diamond bearing rocks in Eastern part of Iran and western part of Afghanistan?

What is the difference between mathematical R^4 space and physical 4D unit space?

If Banks do not provide credit facility, what are the options available for FPOs and impact on producer’s income?

Controlling for pupil light reflex when analyzing pupil size time course?

What are a “Farmers Producer Organization” (FPO) and its essential features?

Strugglling with m6A dot blot any suugesstion ?

Do interactions between biosphere, carbon cycle, & water cycle impact global warming & interaction between atmosphere & hydrosphere?

How to get moment output in Abaqus Standart?

How is energy cycled through the Earth's climate system and how do matter cycle and energy flow through the rock cycle?

Broca’s area must be intact for the learning of new movement sequences?

What are the current challenges and future prospects of integrating artificial intelligence into recognition systems for autonomous vehicles?

Help me download paper?

What is the difference between opportunity recognition in entrepreneurship literature and sensing in dynamic capabilities theory?

What is the effectiveness of AI-powered language learning tools in improving language acquisition skills in children with speech and language delays?

I am working on a network for facial expretion recognition and I have problem with the loss function can anyone help?

Is the pure phonemic content related to emotional valence?

What are the challenges of developing technology for real-time speech translation?

Is it really worthy to have "Recognition Certificate" from unknown and unverified source?

Help Needed: How to Develop a Deep Learning Algorithm for Action Recognition in Assembly101 dataset Videos?