CMU-Sphinx Cheatsheet
👋 Hi, it’s Josh here. I’m writing you this note in 2021: the world of speech technology has changed dramatically since CMU-Sphinx. Before devoting significant time to deploying CMU-Sphinx, take a look at 🐸 Coqui Speech-to-Text. It takes minutes to deploy an off-the-shelf 🐸 STT model, and it’s open source on Github. I’m on the Coqui founding team so I’m admittedly biased. However, you can tell from this blog that I’ve spent years working with speech technologies like CMU-Sphinx, so I understand the headaches.
With 🐸 STT, we’ve removed the headaches and streamlined for production settings. You can train and deploy state-of-the-art 🐸 Speech-to-Text models in just minutes, not weeks. Check out the 🐸 Model Zoo for open, pre-trained models in different languages. Try it out for yourself, and come join our friendly chatroom 💚
Overview
Below are some commands that I’ve found particularly useful in working with CMU-Sphinx from the command line (i.e. Bash) on my Linux machine.
I hope they’re helpful to others, and if you have comments or suggestions for other commands to include, leave a comment! I’d like to get as much on here as possible, but still have it be more of a cheatsheet than a manual.
Acoustic Model Training
The following is the file structure you’re going to need to train an acoustic model from a set of transcribed sound files. This procedure relies on sphinxtrain which I show how to download in another post. Filenames matter! and in the file structure below, your_model should always be the same
File Structure:
your_model/
etc/
your_model.dic - Phonetic dictionary
your_model.phone - Phoneset file
your_model.lm - Language model
your_model.filler - List of fillers
your_model_train.fileids - List of files for training
your_model_train.transcription - Transcription for training
your_model_test.fileids - List of files for testing
your_model_test.transcription - Transcription for testing
wav/
speaker_1/
file_1.wav - Recording of speech utterance
file_2.wav - Recording of speech utterance
speaker_2/
file_1.wav
file_2.wav
So, that’s the overall file structure you need, and here’s what the contents of the files should include:
File contents
.fileids (paths to WAV files w/o extensions):
speaker_1/file_1
speaker_2/file_2
...
.transcription (transcription + file id w/o path or extension):
<s> hello world </s> (file_1)
<s> foo bar </s> (file_2)
...
.wav (Recording files):
WAV, (16 kHz, 16 bit, mono - desktop), (8kHz, 16bit, mono - telephone)
**Audio format mismatch is the most common training problem**
.dict (use alphanumeric-only, case-insensitive):
HELLO HH AH L OW
WORLD W AO R L D
...
.phone (one phone per line + SIL for silence):
SIL
A
O
I
...
.lm (or .lm.bin, commonly in ARPA format):
\data\
ngram 1=7
ngram 2=7
\1-grams:
0.1 <UNK> 0.5555
0 <s> 0.4939
...
.filler (filler noise dict):
<s> SIL
</s> SIL
<sil> SIL
+um+ ++um++
+laugh+ ++laugh++
...
Once you’ve got the structure correct and the appropriate contents for the files, the training command is actually very simple:
Below is an example of what my file structure looks like before training an acoustic model.
In my case, the main directory is located at ~/Desktop/kyrgyz. So everywhere above where you see your_model in the file and directory names, I just have kyrgyz.
Before training, we should just see two directories in the main dir:
Just looking into etc/ we see the following:
If you look into the wav/ dir, you will have any number of subdirs full of WAV files. Here I only have one subdir, audio-files:
Since this setup is correct, we can run the simple command to train the acoustic model:
If we look into the main dir again, we will see a lot got generated:
In my original kyrgyz/etc/ dir, some new files appear as well (feat.params and sphinx_train.cfg).
If we look into sphinx_train.cfg, we see the following:
As you can see, it’s very important that you are consistent in your file and directory names, because they are dynamically generated at configuration based on the name you provide for your_model. In my case, you can see that the language model location is found by referencing the base/main directory as such: $DEC_CFG_LANGUAGEMODEL = “$CFG_BASE_DIR/etc/${CFG_DB_NAME}.lm.DMP”. So, if you are inconsistent, your configuration won’t be able to find your language model or other files.
You should take a look over sphinx_train.cfg to make sure everything is in its right place.
Once you’re sure you’re ready, you can do the actual training with:
Testing pocketsphinx
Test your system with a microphone:
Test your system (i.e. make predictions) on a bunch of WAV files:
Once you have run batch decoding on some files and generated a predictions.hyp file, do this command to see how good your hypotheses were: