👋 Hi, it’s Josh here. I’m writing you this note in 2021: the world of speech technology has changed dramatically since Kaldi. Before devoting weeks of your time to deploying Kaldi, take a look at 🐸 [Coqui Speech-to-Text][coqui-github]. It takes minutes to deploy an off-the-shelf 🐸 STT model, and it’s [open source on Github][coqui-github]. I’m on the Coqui founding team so I’m admittedly biased. However, you can tell from this blog that I’ve spent years working with Kaldi, so I understand the headaches.
With 🐸 STT, we’ve removed the headaches of Kaldi and streamlined everything for production settings. You can train and deploy state-of-the-art 🐸 Speech-to-Text models in just minutes, not weeks. Check out the [🐸 Model Zoo][coqui-model-zoo] for open, pre-trained models in different languages. Try it out for yourself, and come join our [friendly chatroom][coqui-gitter] 💚
Introduction
In an attempt to be more systematic about tuning my hyperparameters for an nnet3 model, I’ve decided to keep this post as a kind of collection of running notes.
DNN Architecture
Looking through the official example scripts (found in KALDI/egs/), here’s what I find for typical architectures:
Switchboard Corpus
Wallstreet Journal
Tedlium
Some scripts use the steps/nnet3/tdnn/make_configs.py script to automatically generate the config file, and they look something like this:
Librispeech
Fisher Switchboard
Learning Rates
Number of Training Epochs
Number of Jobs Initial / Final
TL;DR
small data ==> small num jobs
few GPUs ==> small num jobs
Some insights
Dan Povey’s reponse to a question about optimizing num-jobs-intial and num-jobs-final:
Q: In training nnet2/nnet3 networks, do the num-jobs-initial/final parameters make any important difference to the result, or is this just an efficiency heuristic to distribute as many jobs as the nnet can deal with (in terms of averaging the parallel jobs results I suppose)?
[ David van Leeuwen ]
A: They do make a bit of a difference, but in general fewer is better [except if your system has too many parameters and is overtraining excessively, but best the solution for that is to use fewer epochs or fewer parameters.]
… in general it’s best to use no more jobs than the number of GPUs that you have. And when reducing the num-jobs, it might be a good idea to reduce the number of epochs slightly.
[ Dan Povey ]
Another response from Povey on the Kaldi teams views towards these parameters:
Q: if I change the value of num-jobs-initial and num-jobs-final : do I need to change also the other parameters like the learning rate? changing the values of num-jobs-initial and num-jobs-final doesn’t affect the accuracy of the training ? That’s why I wonder why the author of run_tdnn.sh has chosen the value “2” and “12” for these parameters, it’s the same question when I see others parameters in kaldi scripts like “$nj” set to a precise value, if we change $nj does it affect the final accuracy and the WER ?
[ Mannix ]
A: You don’t have to change the learning rate, those are automatically adjusted in a suitable way. It may affect results, and it may be a good idea to reduce the num-epochs if you reduce the num-jobs, but we normally don’t tune the num-jobs for WER (we choose it based on convenience in training time) so the WER could be better or worse. In general, for non-neural-net scripts and for decoding, the $nj option makes no difference to the results.
[ Dan Povey ]
Nnet3Eg
How nnet3 examples are saved to disk.
Working with TDNN or another kind of feed-forward (non-recurrent) neural net, we will find the training examples saved to disk in archive files, in pairs of [data,label].
You can use something like nnet3-copy-egs ark:egs.1.ark ark,t:- | head to take a quick look at your data in human-readable format.
Here’s one training example from some librispeech data:
Now let’s break down this big mess.
First we get info about the audio file this example comes from: libri-org_19-198-0024-1068
Then we get <NumIo> 2, which tells us
We can find how many frames of audio are in this example: <I1V> 36
Then we see a long list of things like <I1> 0 1 0, one tag for each of our 36 frames, where the middle number is the relative position to the central frame.
Then comes the audio features: 21.62141 -1.176828 -3.170161 -0.491064 -2.720289 -2.210604 0.3688199 1.155272 -0.03023905 -0.007696561 1.108493 0.08210284 0.2613205 one row of features for each frame.
After all the audio features comes info about the labels:
For our 36 audio frames, we have eight labels: <I1V> 8, this has to do with splicing.
Next comes, like above, information about relative location of labels: <I1> 0 4 0
Given that these labels are one-hot vectors, they are represented as just their indeces: dim=856 [ 524 1 ] dim=856 [ 178 1 ] the dim=856
reminds us how many senomes are in the model (hence dimensions in the target vectors. The [ 524 1 ]` tell us the senome ID and the weight of the posterior. This is in my case trivial, because this translates to “the current frame has the label 524 with 100% certainty.”
Other Documentation
Here’s some documentation I wrote up as a walk-through of a typical Kaldi GMM run.sh script.
Here’s some more general Kaldi notes I’ve put together. They cover topics like HCLG and file formats.