Kaldi Hyperparameter Cheatsheet
👋 Hi, it’s Josh here. I’m writing you this note in 2021: the world of speech technology has changed dramatically since Kaldi. Before devoting weeks of your time to deploying Kaldi, take a look at 🐸 Coqui Speech-to-Text. It takes minutes to deploy an off-the-shelf 🐸 STT model, and it’s open source on Github. I’m on the Coqui founding team so I’m admittedly biased. However, you can tell from this blog that I’ve spent years working with Kaldi, so I understand the headaches.
With 🐸 STT, we’ve removed the headaches of Kaldi and streamlined everything for production settings. You can train and deploy state-of-the-art 🐸 Speech-to-Text models in just minutes, not weeks. Check out the 🐸 Model Zoo for open, pre-trained models in different languages. Try it out for yourself, and come join our friendly chatroom 💚
Introduction
The following is a cheatsheet for common hyperparameters in Kaldi.
If you’re looking to get started with Kaldi, here’s an installation guide and DNN training guide. If you’d like a simple, easy to understand Kaldi recipe, you can check out the easy-kaldi
GitHub repo. You probably won’t get state-of-the-art results with easy-kaldi
, but you will hopefully be able to understand the pipeline.
If you’re looking for more in-depth troubleshooting, check out Kaldi Troubleshooting from Head-to-Toe.
Hyperparameter Cheatsheet
The following parameter ranges are what I would recommend as a good starting place. However, what works for your data and your application may differ.
GMM-HMM Alignment
Each of these following steps depends on the previous step. If you have bad monophone alignments, you will have bad triphone alignments. If you have bad triphone alignments, then you will train a bad neural net. As such, you should take some time to tweak parameters on each stage, to make sure your model and alignments are good to pass on to the next stage.
The parameters listed here have two values associated with them, N → M
. Good model parameters for your data should be somewhere in between the extremes of N
and M
, so I’d advise some binary search to find good settings for you. Optimize for number of training iterations only after you’ve gotten good numbers for numleaves
and totgauss
.
DNN Training
You should ideally be using nnet3
instead of nnet2
. At this point, nnet3
is tried and tested more, and will have better support moving forward.
Long-skinny nets are better than short fat ones. Monitor your training progress with information from compute_prob_train.*.log
and compute_prob_valid.*.log
.
Decoding
There’s a speed / accuracy trade-off at decoding time. You can decode faster by considering fewer possible words / phrases, but if you don’t consider the correct word, then you’ve missed it.
Conclusion
If you still are running into issues, look on kaldi-help
to see if someone else has had your problem (often they have). As far as I know, kaldi-help
is the only forum for these kinds of questions. Unfortunately, the atmosphere on kaldi-help
can be unwelcoming to newcomers. If you post a question and get a critical response, don’t let it upset you. The forum can be un-welcoming, and it has nothing to do with you or your abilities to do good ASR! I was afraid to post on kaldi-help
for a long time because of this atmosphere, and indeed my first post was not received kindly. Alternatively, you can post questions here on my blog. However, the community here is not as large as kaldi-help
.
I hope this was helpful, and happy Kaldi-ing!
Let me know if you have comments or suggestions and you can always leave a comment below.