I just want to try out speaker recognition with i vector approach. I just want to ask if some one can walk me through the steps of i-vector generation.
There are open source libraries that can help you build a speaker recognition system from scratch. Depending on which language / environement you're compfortable with, you can use one of these libraries/platforms :
- ALIZE/LIA_RAL : It's an open source platform for biometrics authentification built using C++ and supported by the LIA lab from the University of Avignon (France). The LIA_SpkDet is what you are looking for, it provides tools that allow you to build a world model (UBM), train an i-vector extractor (T matrix), extract i-vectors and do the i-vectors normalization and scoring (using PLDA, WCCN, Mahalanobis, ..) (Download link : http://mistral.univ-avignon.fr/download_en.html)
A tutorial is provided here for the creation of an i-vector -based system (Tutorial to use LIA_SpkDet - i-vector based system) : http://mistral.univ-avignon.fr/doc_en.html
- SIDEKIT : It's an open source python package for speaker/language recognition and speaker diarization supported by the LIUM lab, University of Lemans (France). The download page and documentation can be found at : http://lium.univ-lemans.fr/sidekit/
- MSR Identity Toolbox : It's a matlab Toolbox supported by Microsoft that provides tools for UBM training, i-vector extraction and scoring (Link : http://research.microsoft.com/apps/pubs/default.aspx?id=205119)
----------------------
First of all, it's important to understand the components of a speaker recognition system and the models and data involved. A good starting point would be this tutorial [1] and this review [2].
The three major building blocks of a speaker recognition system are :
- Feature extraction : Generally, either MFCC or PLP features are used in speaker recognition systems. The goal of this step is to compute a more compact and effective representation of speech samples by using frequency analysis techniques (eg. FFT).
- I-vector extraction : the word i-vector stands for "identity i-vector". It allows to compute an utterance model (a vector) using the corresponding MFCC features. This is done using a technique called Factor analysis which computes 0th and 1st order statistics of a set of featres over a generic model (a mixture of Gaussians) called world model or universal background model UBM, then computes the corresponding i-vector. Roughly, the UBM descibes what the average distribution of "speech" looks like and i-vectors allow to have a representation of a speech utterance that is "relative" to that generic model.
- I-vector scoring : This step allows to assess the "closeness" of two i-vectors based on a large set of examples. The most used scoring model is PLDA (Probabilistic linear discriminant analysis); it allows to use between- and within- speaker distributions to compute a "distance" (actually it's a likelihood ratio) between two i-vectors.
---------------------------
I will explain the use of ALIZE/LIA_RAL since it's the one I'm the most familiar with :
1 - Download the sources of the last versions (under Released versions):
http://mistral.univ-avignon.fr/download.html
2 - Uncompress ALIZE_3.0.zip and LIA_RAL_3.0.zip
http://mistral.univ-avignon.fr/doc.html
3 - Now you have to compile ALIZE first, then LIA_RAL (LIA_RAL uses the Alize library). You'll need autotools and automake to compile alize so if you don't have all the required binaries installed on your system (aclocal, automake and autoconf), you can run :
---------------------------------
sudo apt-get install autotools-dev
sudo apt-get install automake
----------------------------------
If you're using MacOS you'll probably need use brew of macPorts.
Now that everything is ready, you can compile Alize (these steps are mentioned in the README file) :
---------------
cd ALIZE_3.0/
aclocal
automake
autoconf
./configure
make
----------------
Then, you can compile LIA_RAL by following the same steps:
--with-alize : contains the absolute path to ALIZE_3.0/
--enable-MT : enables the use of multithreading in LIA_RAL binaries
3 - If nothing went wrong in the previous step, you should have under LIA_RAL_3.0/bin/ all the binaries you'll need to build a speaker recognition system.
5 - The example code provided in this tutorial is based on NIST data (ndx/NIST_files.ndx), you'll have to copy all needed .sph files under data/sph/ in order to try it.
### Step 1 : feature extraction using SPRO or HTK : ####
The script 01_RUN_feature_extraction.sh allows you to extract MFCC features either by using SPRO or HTK (all needed binaries can be found under ./bin/).
The output is a set of .tmp.prm files (one for each .sph file) generated under ./data/prm/.
### Step 2 : Front-end processing : ####
Depending on the feature format used in the provious step (SPRO or HTK), you'll have to run either 02b_RUN_htk_front-end.sh or 02a_RUN_spro_front-end.sh.
This will accomplish three things :
- Energy normalization : this generates a .enr.tmp file for each .tmp.prm file where the energy is normalized.
- Energy Detection : in this step, the distibution of energy in each utterance is used to detect the speech intervals (voice activity detection).
- Features normalization : in this phase, "NormFeat" uses .tmp.prm files (non-normalized MFCC features) and the corresponding .lbl files (speech zones) to perform MVN normalization (mean-variance normalization). The output is a .norm.prm file for each utterance.
### Step 3 : Training a UBM and TV matrix ####
In this step, the universal background model and total variability matrix are trained using the normalized MFCC features corresponding to train data. Depending on the size of your train set, the number of Gaussians in the GMM model and the size of the TV matrix, this step can take two or three days (even with the multithreaded version). The number of threads can be specified in the config files TrainWorld.cfg and TotalVariability.cfg using the option numThread.
Execution :
./03_RUN_i-vector_background.sh
### Step 4 : I-vector processing / scoring ####
Different i-vector normalization techniques can be found in 04_RUN_i-vector_processing.sh and 05_RUN_i-vector_step-by-step.sh (WCCN, EFR, SphNorm) along with two-covariance and PLDA scoring. You'll have to define speaker models, the corresponding target i-vectors and the set of trials you want to test ( take a look at ivTest_plda_target-seg.ndx and trainModel.ndx).
Once the scores are generated, you can use the BOSARIS toolkit (https://sites.google.com/site/bosaristoolkit/) or NIST SRE tools (https://sites.google.com/site/sretools/) to compute equal error rates and plot DET curves.
I hope this helps !
----
References :
[1] Bimbot, Frédéric, et al. "A tutorial on text-independent speaker verification." EURASIP Journal on Advances in Signal Processing 2004.4 (2004): 1-22.
[2] Hansen, John HL, and Taufiq Hasan. "Speaker Recognition by Machines and Humans: A tutorial review." Signal Processing Magazine, IEEE 32.6 (2015): 74-99.
There are open source libraries that can help you build a speaker recognition system from scratch. Depending on which language / environement you're compfortable with, you can use one of these libraries/platforms :
- ALIZE/LIA_RAL : It's an open source platform for biometrics authentification built using C++ and supported by the LIA lab from the University of Avignon (France). The LIA_SpkDet is what you are looking for, it provides tools that allow you to build a world model (UBM), train an i-vector extractor (T matrix), extract i-vectors and do the i-vectors normalization and scoring (using PLDA, WCCN, Mahalanobis, ..) (Download link : http://mistral.univ-avignon.fr/download_en.html)
A tutorial is provided here for the creation of an i-vector -based system (Tutorial to use LIA_SpkDet - i-vector based system) : http://mistral.univ-avignon.fr/doc_en.html
- SIDEKIT : It's an open source python package for speaker/language recognition and speaker diarization supported by the LIUM lab, University of Lemans (France). The download page and documentation can be found at : http://lium.univ-lemans.fr/sidekit/
- MSR Identity Toolbox : It's a matlab Toolbox supported by Microsoft that provides tools for UBM training, i-vector extraction and scoring (Link : http://research.microsoft.com/apps/pubs/default.aspx?id=205119)
----------------------
First of all, it's important to understand the components of a speaker recognition system and the models and data involved. A good starting point would be this tutorial [1] and this review [2].
The three major building blocks of a speaker recognition system are :
- Feature extraction : Generally, either MFCC or PLP features are used in speaker recognition systems. The goal of this step is to compute a more compact and effective representation of speech samples by using frequency analysis techniques (eg. FFT).
- I-vector extraction : the word i-vector stands for "identity i-vector". It allows to compute an utterance model (a vector) using the corresponding MFCC features. This is done using a technique called Factor analysis which computes 0th and 1st order statistics of a set of featres over a generic model (a mixture of Gaussians) called world model or universal background model UBM, then computes the corresponding i-vector. Roughly, the UBM descibes what the average distribution of "speech" looks like and i-vectors allow to have a representation of a speech utterance that is "relative" to that generic model.
- I-vector scoring : This step allows to assess the "closeness" of two i-vectors based on a large set of examples. The most used scoring model is PLDA (Probabilistic linear discriminant analysis); it allows to use between- and within- speaker distributions to compute a "distance" (actually it's a likelihood ratio) between two i-vectors.
---------------------------
I will explain the use of ALIZE/LIA_RAL since it's the one I'm the most familiar with :
1 - Download the sources of the last versions (under Released versions):
http://mistral.univ-avignon.fr/download.html
2 - Uncompress ALIZE_3.0.zip and LIA_RAL_3.0.zip
http://mistral.univ-avignon.fr/doc.html
3 - Now you have to compile ALIZE first, then LIA_RAL (LIA_RAL uses the Alize library). You'll need autotools and automake to compile alize so if you don't have all the required binaries installed on your system (aclocal, automake and autoconf), you can run :
---------------------------------
sudo apt-get install autotools-dev
sudo apt-get install automake
----------------------------------
If you're using MacOS you'll probably need use brew of macPorts.
Now that everything is ready, you can compile Alize (these steps are mentioned in the README file) :
---------------
cd ALIZE_3.0/
aclocal
automake
autoconf
./configure
make
----------------
Then, you can compile LIA_RAL by following the same steps:
--with-alize : contains the absolute path to ALIZE_3.0/
--enable-MT : enables the use of multithreading in LIA_RAL binaries
3 - If nothing went wrong in the previous step, you should have under LIA_RAL_3.0/bin/ all the binaries you'll need to build a speaker recognition system.
5 - The example code provided in this tutorial is based on NIST data (ndx/NIST_files.ndx), you'll have to copy all needed .sph files under data/sph/ in order to try it.
### Step 1 : feature extraction using SPRO or HTK : ####
The script 01_RUN_feature_extraction.sh allows you to extract MFCC features either by using SPRO or HTK (all needed binaries can be found under ./bin/).
The output is a set of .tmp.prm files (one for each .sph file) generated under ./data/prm/.
### Step 2 : Front-end processing : ####
Depending on the feature format used in the provious step (SPRO or HTK), you'll have to run either 02b_RUN_htk_front-end.sh or 02a_RUN_spro_front-end.sh.
This will accomplish three things :
- Energy normalization : this generates a .enr.tmp file for each .tmp.prm file where the energy is normalized.
- Energy Detection : in this step, the distibution of energy in each utterance is used to detect the speech intervals (voice activity detection).
- Features normalization : in this phase, "NormFeat" uses .tmp.prm files (non-normalized MFCC features) and the corresponding .lbl files (speech zones) to perform MVN normalization (mean-variance normalization). The output is a .norm.prm file for each utterance.
### Step 3 : Training a UBM and TV matrix ####
In this step, the universal background model and total variability matrix are trained using the normalized MFCC features corresponding to train data. Depending on the size of your train set, the number of Gaussians in the GMM model and the size of the TV matrix, this step can take two or three days (even with the multithreaded version). The number of threads can be specified in the config files TrainWorld.cfg and TotalVariability.cfg using the option numThread.
Execution :
./03_RUN_i-vector_background.sh
### Step 4 : I-vector processing / scoring ####
Different i-vector normalization techniques can be found in 04_RUN_i-vector_processing.sh and 05_RUN_i-vector_step-by-step.sh (WCCN, EFR, SphNorm) along with two-covariance and PLDA scoring. You'll have to define speaker models, the corresponding target i-vectors and the set of trials you want to test ( take a look at ivTest_plda_target-seg.ndx and trainModel.ndx).
Once the scores are generated, you can use the BOSARIS toolkit (https://sites.google.com/site/bosaristoolkit/) or NIST SRE tools (https://sites.google.com/site/sretools/) to compute equal error rates and plot DET curves.
I hope this helps !
----
References :
[1] Bimbot, Frédéric, et al. "A tutorial on text-independent speaker verification." EURASIP Journal on Advances in Signal Processing 2004.4 (2004): 1-22.
[2] Hansen, John HL, and Taufiq Hasan. "Speaker Recognition by Machines and Humans: A tutorial review." Signal Processing Magazine, IEEE 32.6 (2015): 74-99.
(1) In file "ivTest_WCCN_Cosine.cfg" for cosine distance scoring, there is an option of "backgroundNdxFilename" which is set to "plda.ndx" in the example.
Can someone explain how is the plda.ndx file generated?
Do i need to generate another plda.ndx file for my own data?
(2) In the "Trainworld.cfg" there is an option of "inputFeatureFilename" which is set to "UBM.lst". The UBM.lst file contains the speaker names of the example.
Do i need to use my own data names in UBM.lst file?