👋 Hi, it’s Josh here. I’m writing you this note in 2021: the world of speech technology has changed dramatically since Kaldi. Before devoting weeks of your time to deploying Kaldi, take a look at 🐸 Coqui Speech-to-Text. It takes minutes to deploy an off-the-shelf 🐸 STT model, and it’s open source on Github. I’m on the Coqui founding team so I’m admittedly biased. However, you can tell from this blog that I’ve spent years working with Kaldi, so I understand the headaches.
With 🐸 STT, we’ve removed the headaches of Kaldi and streamlined everything for production settings. You can train and deploy state-of-the-art 🐸 Speech-to-Text models in just minutes, not weeks. Check out the 🐸 Model Zoo for open, pre-trained models in different languages. Try it out for yourself, and come join our friendly chatroom 💚
Installation via GitHub
Kaldi is primarily hosted on GitHub (not SourceForge anymore), so I’m going to just clone the official GitHub repository to my Desktop and go from there.
Taking a look inside to see what I just cloned:
Now there’s a lot of good official documentation for Kaldi, but I think the best install info will always be in the INSTALL file on the latest version is. So, let’s take a look:
First things first, it says to go to tools/ and follow those instructions. So, lets cd into tools/ and see what’s there:
Looking into the INSTALL file, we see:
So, first we need to check out dependencies:
I’m OK on this one, but I have a feeling others will need to do some installing of dependencies before they move on. I’d recommend running that check_dependencies.sh script after you do your installs to make sure you actually did install what you needed and that it’s in the right spot.
Moving along, we need to run make. There’s an option here for parallelizing this step, so I’m going to check how many processors I have:
So I can run make on all 4 of my processors like this:
Those last lines recommend we install a language modeling toolkit IRSTLM, and I want to make my own language models, so I’m going to install it. If you’re using some pre-existing language model, you can skip these next few steps.
Now we should have a working installation of IRSTLM on the computer, and you can verify by looking into /usr/local:
We don’t have to do anything else with IRSTLM right now because we’re just installing. But it’ll be there when you need it!
So, at this point we’ve done part (1) of the kaldi/INSTALL file (i.e. following the steps in the kaldi/tools/INSTALL file).
Now let’s go on to step (2), and follow the instructions in kaldi/src/INSTALL.
Looking into the INSTALL file itself:
Like it says, the first step is to run the ./configure script:
Now we run make depend:
And finally, make:
If you’ve gotten to this point without any hiccups, you should now have a working installation of Kaldi!
Testing Kaldi Out
The YESNO Example Recipe
To make sure our install worked well, we can take advantage of the examples provided in the kaldi/egs/ directory:
Let’s take a look at the README.txt file:
Since we can try out yesno off the shelf (the WAV files are downloaded when you run the run.sh script), we’re going to go with that one.
Pre-Training File Structure
To get a clearer picture of the file structure, I like to use the tree command to display the file structure as a tree with indented braches. You might have to install tree, but I’d say it’s worth it.
These original directories contain general information about the language (in the input/ dir) as well as instructions for preparing the data and scoring it (in the local/ dir) as well as information about the kind of model we want to train and test (in the conf/ dir).
More big-picture scripts (e.g. training monophones, extracting MFCCs from WAV files, etc) are in the steps/ and utils/ dirs. In this case, since these scripts are easily generalizable, Kaldi stores them for all examples in the same directory (in the Wall Street Journal example). All other example dirs (like YESNO) have symbolic links to those dirs.
Data Prep & Training & Testing: The run.sh Script
Now lets cd into the s5/ directory (which holds all the relevant scripts and data for running this example) and run the run.sh script.
You can see from the last line of output, that as we were warned in the README, this data set is not interesting because we get perfect performance, and our percent Word Error Rate was indeed %0.00.
Post-Training & Testing File Structure
If we take another look at the yesno dir, we will see that our run.sh file generated some more directories and files for us.
I’m going to use the tree function below with the -d flag so we only see directories. Otherwise, all the downloaded WAV files are listed and it’s a little much.
Walking down the subdirs, we can see that the three original dirs were left unchanged:
These are unchanged because these original directories are housing general information about the language (in the input/ dir) as well as instructions for preparing the data and scoring it (in the local/ dir) as well as information about the kind of model we want to train and test (in the conf/ dir).
Logically, nothing about these files and directories should change after we train and test the model.
However, the newly created data/ directory has a lot of new stuff in it. In general, this directory created by the run.sh script houses and organizes the files which describe the language (e.g. dictionary, phone lists, etc) and data (e.g. WAV file ids and their transcripts) to test and train the model.
The next directory created by the run.sh script is the exp/ directory. As far as I can gather, “exp” is short for “experiment”. I think this is the case becuase the exp/ dir holds information about the model you’re training and testing. It has a lot of files as you see below, and you can see that a lot of them (if not most) are .log files.
I think that Kaldi could have more transparent naming conventions for files and directories, but I will say that the log files are very thorough. There’s a lot of info to be found if you do some digging.
The last directory created by the run.sh script isn’t super interesting, but it’s essential. This is the mfcc/ dir. This directory holds all the .ark (archive) and .scp (script) files for (1) the MFCC features as well as (2) the cepstral mean and variance statistics per speaker.
I hope this was helpful!
Let me know if you have comments or suggestions and you can always leave a comment below.