👋 Hi, it’s Josh here. I’m writing you this note in 2021: the world of speech technology has changed dramatically since Kaldi. Before devoting weeks of your time to deploying Kaldi, take a look at 🐸 Coqui Speech-to-Text. It takes minutes to deploy an off-the-shelf 🐸 STT model, and it’s open source on Github. I’m on the Coqui founding team so I’m admittedly biased. However, you can tell from this blog that I’ve spent years working with Kaldi, so I understand the headaches.

With 🐸 STT, we’ve removed the headaches of Kaldi and streamlined everything for production settings. You can train and deploy state-of-the-art 🐸 Speech-to-Text models in just minutes, not weeks. Check out the 🐸 Model Zoo for open, pre-trained models in different languages. Try it out for yourself, and come join our friendly chatroom 💚

If you want to take a step back and learn about Kaldi in general, I have posts on how to install Kaldi or some miscellaneous Kaldi notes which contain some documentation.

All the scripts presented in this post or required for training can be downloaded here or cloned from here:

Much of the information in this post comes directly from the official nnet2 documentation.

If there are any mistakes or misrepresentations in this post, they are more than likely my fault and not of the Kaldi team. If you find something off, or have questions, please leave a comment!

## Introduction

If you’re reading this, you’ve probably already trained a standard GMM-HMM acoustic model.

If not, you should do that first, because the standard Kaldi scripts for DNN training assume you have trained a GMM-HMM and generated alignments for your training audio.

You should know right off-the-bat that training a DNN acoustic model does not start the same way as in training a GMM acoustic model.

In the simplest case with GMM-HMM training, we train a monophone model from a flat start (i.e. from no previous phoneme-to-audio alignments). That means we can begin training a new model with just utterance-level transcriptions.

In DNN training, on the otherhand, we typically start training not immediately from utterance-level transcriptions, but from the labeled frames (phoneme-to-audio alignements) which were generated by a GMM-HMM system. This means, your DNN will be greatly affected by the quality of the GMM-HMM you previously trained. A bad GMM-HMM will give you bad alignments, and if you train a DNN with bad alignments, you will have a bad DNN at the end of the day, no matter how many epochs you run, what kind of cost function you use, or how clever your learning rate is.

A neural net is a tool for classification. We want to be able to take some new features (i.e. audio features) and assign a class to those features (i.e. a phoneme label). So, our acoustic model DNN will have input nodes which correspond to the dimensions of our audio features (think, for example, 39 input nodes for 39 MFCCs), and output nodes which correspond to senome labels (think, 900 output nodes for 900 context dependent triphones (decision tree leaves)). As you can see, two key parts of the DNN acoustic model (input layer and output layer) are modeled after the features used to train a GMM-HMM and the decision tree created for that GMM-HMM.

The dimensions of the hidden layers, on the other hand, are not constrained by the architecture of the previous GMM-HMM or the dimensionality of the audio features. Therefore, the number and size of the hidden layers are up to the researcher / developer.

Once we have a DNN which has the correct number of dimensions for input and output nodes, we can take our phoneme-to-audio alignments and train our neural net.

The audio feature frames are fed into the input layer, the net will assign a phoneme label to a frame, and since we already have the gold-standard label (i.e. phoneme label from our GMM-HMM alignments) for any given frame, we compare what the neural net predicted and what the real phoneme was. Using some loss function and backpropigation, we iterate over all of our training frames to adjust the weights and biases of our net.

Note that unlike GMM-HMM training where in the EM algorithm we iteratively realign our transcriptions to our audio frames, with DNN training we don’t have to do that. We can, but it’s not necessary.

At the end, we hopefully end up with a DNN which will assign the correct phoneme label to a new, unknown audio frame.

## Training a Deep Neural Net

### Introduction

When learning a new method, I try to begin with the most basic, well-documented code I can find.

So, when approaching Kaldi’s DNN code, I chose to start with the nnet2 code, because even though nnet3 is newer, the scripts for nnet2 have been used more, reviewed more, and there’s more documentation.

After settling on nnet2, I looked for the simplest run script I could find. I chose the run_nnet2_baseline.sh script from the Wall Street Journal egs directory. This script is located at kaldi/egs/wsj/s5/local/online/run_nnet2_baseline.sh. This is, as far as I can gather, the simplest official DNN training script for Kaldi at the present moment.

Going off this script, I threw away as many “extra” features and steps as possible. A lot of what I threw away is very important practically speaking, but I want to show the bare minimum of what you need to do to train and test a DNN.

Amoung other things, I threw away:

1. CMVN adaptation of raw (MFCC/PLP) features
2. pnorm non-linearities
3. online preconditioning of weights and biases
4. all training diagnostics (no more validation examples)
5. final model combination
6. weighting of posteriors of silence phones
7. “mixing up” the number of nodes before the Softmax layer

### First Things First: train a GMM system and generate alignments

I’m not going to go into detail on how to train a GMM system on this post.

However, before you can start training your DNN, you will need the following directories, all generated as part of normal GMM-HMM training in Kaldi:

1. a training data dir (as generated by a prepare_data.sh script in a s5/local directory)
2. a language dir (which has information on your phones, decision tree, etc, probably generated by prepare_lang.sh)
3. an alignment dir (generated by something like align_si.sh).
4. a feature dir (for example MFCCs; made by the make_mfcc.sh script)

Here’s the specific files you need from these four dirs:

Because I like visuals and beating dead horses, here’s the tree structure for the dirs and files you need from your GMM:

For the data dir:

For the lang dir:

For the align dir:

For the mfcc dir:

### The Main Run Script: run_nnet2_simple.sh

I’ve adjusted the this main run script (as well as it’s dependencies) to work for my own data set. Originally, I took run_nnet2_baseline.sh and simplified it as much as possible, changed the normal pnorm nonlinearity to more basic tanh, and removed online preconditioning (which is used to find better learning rates).

In general, I’ve deleted as much as I could of the original script along the way, trying to get to the bare bones of the standard Kaldi neural net training pipeline. The original authors of the scripts had added in lots of good, practical features for training and testing, but as of right now, I’m only concerned with the core of the training procedure. Keep this in mind, because your results will be far from state-of-the-art.

As you may have noticed in other run.sh scripts, this script mainly defines variables and then sends off the heavy lifting of data preparation, training, and testing to other, more specialized scripts.

With this is mind, here’s the beginning of my modified run_nnet2_simple.sh script, which defines some variables:

Next, now that we’ve set our general variables, we have the main code to call the training script. In the original run_nnet2_baseline.sh, the training script called here was train_pnorm_simple.sh. I’ve modified this training script heavily, and most importantly I replaced the pnorm non-linearity with the simpler tanh function, and I also removed preconditioning of the weights and biases. As such, I’ve changed the name from train_pnorm_simple.sh to train_simple.sh.

So, our main run script will call our main train script as such:

As you can see, the only obligatory arguments for train_simple.sh are:

1. the training data
2. the language dir
3. our alignments from our previous GMM-HMM model
4. the name of the dir where we will save our new DNN model

So far this is pretty straightforward: you feed in the right data to the script and you get back a nice shiny DNN. At this point, you could keep this train_simple.sh script a black box, and not worry about what’s going on inside, or you could dive into it. I’m going to take the time to go through it below for anyone who’s interested.

However, for now I’ll finish up presenting the last section of the main run script: testing.

Here’s what our testing section looks like:

It’s as simple as that! You decode the test data using your freshly trained DNN, and you use best_wer.sh to give you a score at the very end for your word error rate.

The main decode script, decode_simple.sh, takes in six arguments. The two *_phone arguments are not usually present in the standard Kaldi scripts, but I added them to make sure I knew where my unknown and silence phones were being used and defined. Feel free to take them out and deduce them from something like phones.txt.

In any case, decode_simple.sh takes in six arguments:

1. the original decoding graph from your GMM-HMM
2. dir for your test data
3. the final, trained DNN acoustic model
4. the “unknown” phone (eg. UNK)
5. the “silence” phone (eg. SIL)
6. new dir to save decoding information in (lattices, etc)

If you’ve made it up to this point, you should have hopefully been able to successfully train and test a deep neural net acoustic model.

It probably didn’t get you a great word error rate, but now that you have something working you can go and tweak parameters, add in more sophisticated non-linearities, try out different posterior weighting, or some CMVN or other speaker adaptive transformations.

For those wanting to get more into the training script, I’m now going to walk through the main training script called by the main run script.

### The Main Training Script: train_simple.sh

First, we have a bunch of default parameters to set:

After parsing the command-line arguments, we check to make sure our important files are where they should be. All these files should have been generated during our GMM-HMM training.

Once we’re sure all the important files are in place, we extract some variables from the information in those files:

At this point in the script, we have defined a bunch of variables, created two files (1) tree (copied from the GMM-HMM), (2) num_jobs, and created an empty log directory. We can see these new additions in our main experiment dir:

Now we move on to data preparation before training, and we start with estimating the LDA feature transform via get_lda_simple.sh. This transformation matrix will be applied to our spliced features before we take them in as input to our DNN.

This script outputs a matrix for the LDA transform, and this same matrix will show up again as a FixedAffineComponent when we initialize the neural net, just after our input layer with splicing. That means, once we’ve got our LDA transform, it will applied to all input, and because it is a FixedComponent, the matrix will not be updated by back-propagation.

Here are the new files created by get_lda_simple.sh.

Now that we’ve estimated the LDA transform, we will format our training data. In the original get_egs.sh script, we split the training data into training and validation, and the validation is used for diagnostics during the training iterations.

However, for my modified get_egs_simple.sh I took out validation and diagnostics altogether. As such, my version merely takes the training data and formats it without chopping it up into various subsets for diagnostics.

After we run the get_egs_simple.sh script, we find that we have generated a new egs dir.

Now that we have the training examples (phone-to-frame alignments) sorted and in the correct format, we go on to initialize our neural net.

Similar to our topo configuration file we used in GMM-HMM training, we have to define the dimensions and architecture of our neural net before we can initialize it. These definitions are stored in a configuration file called nnet.config, and it gets defined and saved to disk as a normal part of the train_simple.sh script.

The below descriptions are taken almost verbatim from the official Kaldi nnet2 docs.

1. SpliceComponent defines the size of the window of feature-frame-splicing to perform.
2. FixedAffineComponent is our LDA-like transform created by get_lda_simple.sh.
3. AffineComponent is the standard Wx+b affine transform found in neural nets. This first AffineComponent represents the weights and biases between the input layer and the first hidden layer.
4. TanhComponent is the standard tanh nonlinearity.
5. AffineComponent is the standard Wx+b affine transform found in neural nets. This second AffineComponent represents the weights and biases between the hidden layer and the output layer.
6. SoftmaxComponent is the final nonlinearity that produces properly normalized probabilities at the output.

This initial DNN configuration contains one and only one hidden layer.

That is, there are six Kaldi components, but only three layers in the net.

It follows that there are only two updatable weight matrices and two updatable bias vectors. If we look back at the nnet.config file definition, there are indeed only two components which are updatable, both are of the form AffineComponent.

Next in the train_simple.sh script, we create another, similar configuration file which defines the structure of the hidden layers we will be adding to our net during training:

Again, we find here an affine transform followed by a non-linearity. Nothing too crazy.

Now, we can put our decision tree, HMM topology file, and nnet.config file to work and initialize our first neural net, aka 0.mdl:

Let’s do another “check-in” to see what files we had created. We find that we’ve only created four new files:

Also, we can take a look at our untrained model and get some info about it by making use of the nifty nnet-am-info.cc program:

Now that we have an initialized model and labeled training examples to go along, we can train the transitions of the HMMs in our DNN-HMM acoustic model. In GMM-HMM training the transitions get updated during EM training, but since we are not doing any realignment for DNN training, the initial transitional probabilities will do just fine.

Since we are seeding with the original 0.mdl and resaving it as the same name 0.mdl, the only file produced by this command is the log file train_trans.log.

Taking another quote from the offical nnet2 docs with regards to nnet-train-transitions.cc:

This computes the transition probabilities that will be used in the HMMs in decoding (which has nothing to do with the neural net itself), and also computes the prior probabilities of the “targets” (the several thousand context-dependent states). Later, when we do decoding, we will divide the posteriors computed by the network by these priors to get “pseudo-likelihoods”; these are more compatible with the HMM framework than raw posteriors.

These priors are stored back in the model, and we can see that they exist by again getting info about our neural net with nnet-am-info.cc

Now we move on to the main training loop, which will update our parameters via backpropagation.

The main program doing the training in this loop is nnet-train-parallel.cc.

Directly quoting from the comments of the source code:

Train the neural network parameters with backprop and stochastic gradient descent using minibatches. As nnet-train-simple, but uses multiple threads in a Hogwild type of update (for CPU, not GPU).

So, using this parallelized training routine, we will in fact train multiple DNNs for each iteration. You see with the log files, we have train.$x.JOB.log, where $x is the iteration number and JOB is the job number. In my case, since I have only four processors on my laptop, I’m running each iteration four jobs. That means each iteration I have to either merge these four nets in a smart way, or just choose the best one.

The original train_pnorm_simple.sh script is set up in a way to either average or choose the best model. The rational behind this is that on certain iterations, the model will be unstable, and it’s better to just choose the best one instead of averaging. The unstable iterations are either (a) the very first iteration or (b) an iteration where we’ve just added a new hidden layer.

In the comments of the original script, the authors note:

On iteration zero or when we just added a layer, use a smaller minibatch size and just one job: the model-averaging doesn’t seem to be helpful when the model is changing too fast (i.e. it worsens the objective function), and the smaller minibatch size will help to keep the update stable.

I’ve removed the “choose the one-best job” option from train_simple.sh. This will surely lead to instability, but it simplifies the training process and makes the big picture easier to see.

Furthermore, in the original script, we would normally have the option of mixing up the number of components in the neural net. However, I chose to take out the mixing up option for the sake of make as minimal a net as possible.

So that’s how train_simple.sh works… I hope you found this walkthrough useful.

## Relevant Papers

### Povey et al. (2015) Parallel Training of DNNs with Natural Gradient and Parameter Averaging

According to the official docs, this is the best published description of nnet2 (Dan’s DNN setup):

We describe the neural-network training framework used in the Kaldi speech recognition toolkit, which is geared towards training DNNs with large amounts of training data using multiple GPU-equipped or multi-core machines. In order to be as hardware-agnostic as possible, we needed a way to use multiple machines without generating excessive network traffic. Our method is to average the neural network parameters periodically (typically every minute or two), and redistribute the averaged parameters to the machines for further training. Each machine sees different data. By itself, this method does not work very well. However, we have another method, an approximate and efficient implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD), which seems to allow our periodic- averaging method to work well, as well as substantially improving the conver- gence of SGD on a single machine.

### Rath et al. (2013) Improved feature processing for Deep Neural Networks

In this paper, we investigate alternative ways of processing MFCC-based features to use as the input to Deep Neural Networks (DNNs). Our baseline is a conventional feature pipe- line that involves splicing the 13-dimensional front-end MFCCs across 9 frames, followed by applying LDA to reduce the dimension to 40 and then further decorrelation using MLLT. Confirming the results of other groups, we show that speaker adaptation applied on the top of these features using feature-space MLLR is helpful. The fact that the number of parameters of a DNN is not strongly sensitive to the input feature dimension (unlike GMM-based systems) motivated us to investigate ways to increase the dimension of the features. In this paper, we investigate several approaches to derive higher-dimensional features and verify their performance with DNN. Our best result is obtained from splicing our baseline 40-dimensional speaker adapted features again across 9 frames, followed by reducing the dimension to 200 or 300 using another LDA. Our final re- sult is about 3% absolute better than our best GMM system, which is a discriminatively trained model.

Dan Povey is on of the co-authors, and there’s a lot of good concise information here which directly relates to Kaldi.

### Maas et al. (2013) Building DNN Acoustic Models for Large Vocabulary Speech Recognition

From the abstract:

Building neural network acoustic models requires several design decisions including network architecture, size, and train- ing loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN clas- sifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance.

Co-authors include Dan Jurafsky and Andrew Ng, among others. This is a longer paper (22 pages) and is very thorough.