Josh's Speaker ID Challenge
NB - This post is not written to a general audience in its current form. The target audience is currently the organizers of the speakerID challenge in which I participated.
Introduction
Code may be found here.
The challenge was set up as such:
Given a training set of audio (from now on, train
), and a set of development data (ie. dev
), create and assess a speaker identification system which can assign a speaker label (spkID
) to a previously unheard test utterance (utt
).
This document describes both the usage and architecture of the created system.
Dependencies
The following assumes a standard Linux
operating system with python2
as well as bash
and perl
installations at /bin
.
The system developed here relies on two main (non-standard) dependencies:
- the
Kaldi
ASR Toolkit - the
sox
sound manipulation program
For Kaldi installation instructions, follow this post: How to install Kaldi.
For sox installation, simply:
Kaldi is used to do most all of the training and testing.
Sox is used to corrupt the original input data to better make the corrupted testing data.
Setup
Data
The training and dev data are set up in the following way. The topmost, audio_dir
can be located anywhere and the scripts will still work provided the correct absolute path.
Code
After Kaldi and sox have been installed, extract the provided scripts and copy them to the following location:
Specifically, here are the scripts provided to you:
Just take those scripts and move them to v1/
.
Usage
For training and testing there are two main run scripts:
run_challenge_train.sh
run_challenge_test.sh
Training
The training script should be run from the sre10/v1
directory:
The training script expects a directory with two sub-directories:
The name audio_dir
is arbitrary, but train
and dev
must be named as such.
Testing
Assuming that you have successfully trained a speakerID system via ./run_challenge_train.sh
, you can now label any 8kHz
audio file with a spkID
label from train
.
This script will print out the assigned label to the terminal.
Training Pipeline
The training procedure for creating the speakerID system is the following:
Backup Original Data, Convert, and downsample
The original train
data came in 16kHz FLAC files, so to make a match with testing (dev
and eval
) data, we need to convert to WAV and downsample.
We can accomplish this with the following code snippet, making sure we first back up the original FLAC data to train-org
.
Corrupt Data
To create train
data more similar to eval
data, the following subscript will perform a lowpass
filter, add Gaussian noise
, and amplify
all existing frequencies:
Prep Data and Extract Features
To get data in the right format for training, and extract audio features we do the following for both train
and dev
data:
Train UBM and T-matrix
The main training sub-script is shown below. This script takes in the train
data and builds an ivector extractor based on a Universal Background Model.
Extract ivectors for Assessment
To assess the performance of the system, we extract ivectors from both train
and dev
, and then later classify the dev
utterances using spkID
s from train
.
Train PLDA on ivectors
Next, to project our ivectors into a new space with better discriminative power, we use PLDA as such:
Assess Classification Error on dev
Next, we take the ivectors we extracted from dev
, along with the averaged ivectors for each spkID
in train
, and compare.
Here we are using a trials
file in which every utt
is compared against every spkID
.
Then we calculate the EER with the following python
script and Kaldi program:
Testing Pipeline
There is much overlap between the Training and Testing scripts, so here I will just highlight the differences.
The main difference is that in run_challenge_test.sh
, we assume the existence of a trained ivector extractor
as well as a trained PLDA matrix
. Also, these two components must be located in the same location as if they had just been trained by the above script, run_challenge_train.sh
.
Feature Extraction
As above, we extract MFCCs
and VADs
for the new audio file.
ivector Extraction
As above, we extract ivectors from the new audio using the trained T-matrix
.
Compare new utt
ivector with spkID
ivectors via PLDA
This step assumes old speaker ivectors as well as a PLDA matrix.
Choose best label
This is specific to testing (ie. doesn’t happen in training).
The way we use PLDA to compare ivectors traditionally results in a similarity score between each pair of utt
and spkID
. However, for our current purposes, we want to find the best spkID
for a given utt
. The following simple script does just that:
At the end, you should have a single spkID
printed to your terminal!
Conclusion
The best system I trained obtained an EER
of 7.5%
, using only the data provided. This system used a full covariance UBM with 200 components and a 200x100 dimensional T-matrix (ie. 100-dimensional ivectors).
Hopefully these instructions and explanation were clear, but if there’s a need to clarification, don’t hesitate to contact me.
Enjoy your new speaker identification system:)