👋 Hi, it’s Josh here. I’m writing you this note in 2021: the world of speech technology has changed dramatically since CMU-Sphinx. Before devoting significant time to deploying CMU-Sphinx, take a look at 🐸 Coqui Speech-to-Text. It takes minutes to deploy an off-the-shelf 🐸 STT model, and it’s open source on Github. I’m on the Coqui founding team so I’m admittedly biased. However, you can tell from this blog that I’ve spent years working with speech technologies like CMU-Sphinx, so I understand the headaches.

With 🐸 STT, we’ve removed the headaches and streamlined for production settings. You can train and deploy state-of-the-art 🐸 Speech-to-Text models in just minutes, not weeks. Check out the 🐸 Model Zoo for open, pre-trained models in different languages. Try it out for yourself, and come join our friendly chatroom 💚





logo

Overview

Below are some commands that I’ve found particularly useful in working with CMU-Sphinx from the command line (i.e. Bash) on my Linux machine.

I hope they’re helpful to others, and if you have comments or suggestions for other commands to include, leave a comment! I’d like to get as much on here as possible, but still have it be more of a cheatsheet than a manual.

Acoustic Model Training

The following is the file structure you’re going to need to train an acoustic model from a set of transcribed sound files. This procedure relies on sphinxtrain which I show how to download in another post. Filenames matter! and in the file structure below, your_model should always be the same

File Structure:

 your_model/
    etc/
            your_model.dic - Phonetic dictionary
	        your_model.phone - Phoneset file
	        your_model.lm - Language model
	        your_model.filler - List of fillers
	        your_model_train.fileids - List of files for training
	        your_model_train.transcription - Transcription for training
	        your_model_test.fileids - List of files for testing
	        your_model_test.transcription - Transcription for testing
    wav/
            speaker_1/
                    file_1.wav - Recording of speech utterance
                    file_2.wav - Recording of speech utterance
	        speaker_2/
                    file_1.wav
                    file_2.wav

So, that’s the overall file structure you need, and here’s what the contents of the files should include:

File contents

.fileids (paths to WAV files w/o extensions):

     speaker_1/file_1
     speaker_2/file_2
          ...

.transcription (transcription + file id w/o path or extension):

           <s> hello world </s> (file_1)
           <s> foo bar </s> (file_2)
                      ...

.wav (Recording files):

 WAV, (16 kHz, 16 bit, mono - desktop), (8kHz, 16bit, mono - telephone)
 **Audio format mismatch is the most common training problem**

.dict (use alphanumeric-only, case-insensitive):

  HELLO HH AH L OW
  WORLD W AO R L D
  ...

.phone (one phone per line + SIL for silence):

   SIL
   A
   O
   I
   ...

.lm (or .lm.bin, commonly in ARPA format):

\data\
ngram 1=7
ngram 2=7

\1-grams:
0.1 <UNK>	0.5555
0 <s>	 0.4939
   ...

.filler (filler noise dict):

    <s> SIL
    </s> SIL
    <sil> SIL
    +um+ ++um++
    +laugh+ ++laugh++	
    ...

Once you’ve got the structure correct and the appropriate contents for the files, the training command is actually very simple:

josh@yoga:~$ sphinxtrain -t your_model setup

Below is an example of what my file structure looks like before training an acoustic model.

In my case, the main directory is located at ~/Desktop/kyrgyz. So everywhere above where you see your_model in the file and directory names, I just have kyrgyz.

Before training, we should just see two directories in the main dir:

josh@yoga:~/Desktop/kyrgyz$ la
etc  wav

Just looking into etc/ we see the following:

josh@yoga:~/Desktop/kyrgyz$ la etc/
kyrgyz.dic     kyrgyz.phone               kyrgyz_train.fileids
kyrgyz.filler  kyrgyz_test.fileids        kyrgyz_train.transcription
kyrgyz.lm      kyrgyz_test.transcription

If you look into the wav/ dir, you will have any number of subdirs full of WAV files. Here I only have one subdir, audio-files:

josh@yoga:~/Desktop/kyrgyz$ la wav
audio-files
josh@yoga:~/Desktop/kyrgyz$ la wav/audio-files/
100.wav  276.wav  450.wav       atai_142.wav  atai_319.wav
101.wav  277.wav  451.wav       atai_143.wav  atai_31.wav
102.wav  278.wav  452.wav       atai_144.wav  atai_320.wav
103.wav  279.wav  453.wav       atai_145.wav  atai_321.wav
104.wav  27.wav   454.wav       atai_146.wav  atai_322.wav
105.wav  280.wav  455.wav       atai_147.wav  atai_323.wav
                            .
                            .
                            .

Since this setup is correct, we can run the simple command to train the acoustic model:

josh@yoga:~/Desktop/kyrgyz$ sphinxtrain -t kyrgyz setup
Sphinxtrain path: /usr/local/lib/sphinxtrain
Sphinxtrain binaries path: /usr/local/libexec/sphinxtrain
Setting up the database kyrgyz

If we look into the main dir again, we will see a lot got generated:

josh@yoga:~/Desktop/kyrgyz$ la
bwaccumdir  feat         logdir              model_parameters  result  wav
etc         kyrgyz.html  model_architecture  qmanager          trees

In my original kyrgyz/etc/ dir, some new files appear as well (feat.params and sphinx_train.cfg).

josh@yoga:~/Desktop/kyrgyz$ la etc/
feat.params    kyrgyz.phone               kyrgyz_train.transcription
kyrgyz.dic     kyrgyz_test.fileids        sphinx_train.cfg
kyrgyz.filler  kyrgyz_test.transcription
kyrgyz.lm      kyrgyz_train.fileids

If we look into sphinx_train.cfg, we see the following:

josh@yoga:~/Desktop/kyrgyz/etc$ cat sphinx_train.cfg 
# Configuration script for sphinx trainer                  -*-mode:Perl-*-

$CFG_VERBOSE = 1;		# Determines how much goes to the screen.

# These are filled in at configuration time
$CFG_DB_NAME = "kyrgyz";
# Experiment name, will be used to name model files and log files
$CFG_EXPTNAME = "$CFG_DB_NAME";

# Directory containing SphinxTrain binaries
$CFG_BASE_DIR = "/home/josh/Desktop/kyrgyz";
                    .
                    .
                    .
$DEC_CFG_LANGUAGEMODEL  = "$CFG_BASE_DIR/etc/${CFG_DB_NAME}.lm.DMP";
# Or can be JSGF or FSG too, used if uncommented
# $DEC_CFG_GRAMMAR  = "$CFG_BASE_DIR/etc/${CFG_DB_NAME}.jsgf";
# $DEC_CFG_FSG  = "$CFG_BASE_DIR/etc/${CFG_DB_NAME}.fsg";

$DEC_CFG_LANGUAGEWEIGHT = "10";
$DEC_CFG_BEAMWIDTH = "1e-80";
$DEC_CFG_WORDBEAM = "1e-40";

$DEC_CFG_ALIGN = "builtin";

$DEC_CFG_NPART = 1;		#  Define how many pieces to split decode in

# This variable has to be defined, otherwise utils.pl will not load.
$CFG_DONE = 1;

return 1;

As you can see, it’s very important that you are consistent in your file and directory names, because they are dynamically generated at configuration based on the name you provide for your_model. In my case, you can see that the language model location is found by referencing the base/main directory as such: $DEC_CFG_LANGUAGEMODEL = “$CFG_BASE_DIR/etc/${CFG_DB_NAME}.lm.DMP”. So, if you are inconsistent, your configuration won’t be able to find your language model or other files.

You should take a look over sphinx_train.cfg to make sure everything is in its right place.

Once you’re sure you’re ready, you can do the actual training with:

sphinxtrain run



Testing pocketsphinx

Test your system with a microphone:

pocketsphinx_continuous -hmm your_hmm_dir -lm your_language_model.lm -dict your_phonetic_dictionary.dic -inmic yes

Test your system (i.e. make predictions) on a bunch of WAV files:

pocketsphinx_batch  -adcin yes  -cepdir dir_of_test_WAVs  -cepext .wav  -ctl your_test.fileids -hmm your_hmm_dir -lm your_language_model.lm -dict your_phonetic_dictionary.dic  -hyp predictions.hyp

Once you have run batch decoding on some files and generated a predictions.hyp file, do this command to see how good your hypotheses were:

perl word_align.pl your_test.transcription predictions.hyp