Some Kaldi Notes
👋 Hi, it’s Josh here. I’m writing you this note in 2021: the world of speech technology has changed dramatically since Kaldi. Before devoting weeks of your time to deploying Kaldi, take a look at 🐸 Coqui Speech-to-Text. It takes minutes to deploy an off-the-shelf 🐸 STT model, and it’s open source on Github. I’m on the Coqui founding team so I’m admittedly biased. However, you can tell from this blog that I’ve spent years working with Kaldi, so I understand the headaches.
With 🐸 STT, we’ve removed the headaches of Kaldi and streamlined everything for production settings. You can train and deploy state-of-the-art 🐸 Speech-to-Text models in just minutes, not weeks. Check out the 🐸 Model Zoo for open, pre-trained models in different languages. Try it out for yourself, and come join our friendly chatroom 💚
Documentation
Here’s some documentation I’ve been working on as a walk-through of a typical Kaldi run.sh script.
Bayes Rule and ASR
This is from a slide in the first set of slides from Dan Povey’s lectures on using Kaldi.
\[P(\text{ utterance } \vert \text{ audio })=\frac{p(\text{ audio } \vert \text{ utterance }) \cdot P(\text{ utterance })}{p(\text{ audio })}\]- \(P(\text{ utterance })\) comes from our language model (i.e. n-gram)
- \(p(\text{ audio } \vert \text{ utterance })\) is a sentence-dependent statistical model of audio production, trained from data
- Given a test utterance, we pick ‘utterance’ to maximize \(P(\text{ utterance } \vert \text{ audio })\).
- Note: \(p(\text{ audio })\) is a normalizer that doesn’t matter
WFST Key Concepts
- determinization
- minimization
- composition
- equivalent
- epsilon-free
- functional
- on-demand algorithm
- weight-pushing
- epsilon removal
HMM Key Concepts
- Markov Chain
- Hidden Markov Model
- Forward-backward algorithm
- Viterbi algorithm
- E-M for mixture of Gaussians
L.fst: The Phonetic Dictionary FST
L maps monophone sequences to words.
The file L.fst is the Finite State Transducer form of the lexicon with phone symbols on the input and word symbols on the output.
See “Speech Recognition with Weighted Finite-State Transducers” by Mohri, Pereira and Riley, in Springer Handbook on SpeechProcessing and Speech Communication, 2008 for more information.
Here’s an example with two words:
The following section comes from the documentation.
The structure of the lexicon is roughly as one might expect. There is one state (the “loop state”) which is final. There is a start state that has two transitions to the loop state: one with silence and one without. From the loop state there is a transition corresponding to each word, and that word is the output symbol on the transition; the input symbol is the first phone of that word. It is important both for the efficiency of composition and the effectiveness of minimization that the output symbol should be as early as possible (i.e. at the beginning not the end of the word). At the end of each word, to handle optional silence, the transition corresponding to the last phone is in two forms: one to the loop state and one to the “silence state” which has a transition to the loop state. We don’t bother putting optional silence after silence words, which we define as words that have just one phone that is the silence phone.
L_disambig.fst: The Phonetic Dictionary with Disambiguation Symbols FST
A lexicon with disambiguation symbols, see Mohri etal’s work for more info.
In general, you need to have disambiguation symbols when you have one word that is a prefix of another (cat and cats in the same lexicon would need to have cat being pronounced “k ae t #1”) or a homophone of another word (red: “r eh d #1”, read: “r eh d #2”). If you don’t have these then the models become nondeterministic.
Symbols like #1 and #2 that go on the ends of words to ensure determinizability.
G.fst: The Language Model FST
FSA grammar (can be built from an n-gram grammar).
C.fst: The Context FST
C maps triphone sequences to monophones.
Expands the phones into context-dependent phones.
H.fst: The HMM FST
H maps multiple HMM states (a.k.a. transition-ids in Kaldi-speak) to context-dependent triphones.
Expands out the HMMs. On the right are the context-dependent phones and on the left are the pdf-ids.
HCLG.fst: final graph
mkgraph.sh: Graph compilation
This script creates a fully expanded decoding graph (HCLG) that represents the language-model, pronunciation dictionary (lexicon), context-dependency, and HMM structure in our model. The output is a Finite State Transducer that has word-ids on the output, and pdf-ids on the input (these are indexes that resolve to Gaussian Mixture Models).
The following files are required:
http://kaldi.sourceforge.net/graph_recipe_test.html
OpenFst Symbol Tables
Both words.txt and phones.txt are OpenFst Symbol Tables
From openfst.org:
Symbol (string) to int and reverse mapping.
The SymbolTable implements the mappings of labels to strings and reverse. SymbolTables are used to describe the alphabet of the input and output labels for arcs in a Finite State Transducer.
SymbolTables are reference counted and can therefore be shared across multiple machines. For example a language model grammar G, with a SymbolTable for the words in the language model can share this symbol table with the lexical representation L o G.
For every FST the symbol ‘0’ is reserved for \(\epsilon\).
words.txt
An OpenFst symbol table.
The file words.txt is created by prepare_lang.sh and is a list of all words in the vocabulary, in addition to silence markers, and the disambiguation symbol “#0” (used for epsilon on the input of G.fst).
Each word has a unique number.
phones.txt
oov.txt
This file has a single line with the word (not the phone!) for out of vocabulary items.
In my case I’m using “<unk>” because that’s what I get from IRSTLM in my language model (task.arpabo), and this entry has to be identical to that.
Tree Info
In the below simple acoustic model, I have 25 phones. I have a 3-state HMM typology for non-silence phones, and a 5-state typology for silence phones. Since I have 24 non-silence phones, and only one silence phone, I get:
(24 x 3) + (1 x 5) = 77
77 total states in the acousitc model.
Looking at the decision tree for a monophone model, we see there is one pdf
per state.
You can easily visualize a tree with Kaldi’s draw-tree.cc
and dot
:
We end up with a nice graph in the file mono-tree.pdf
:
Working with Alignments (Phone or Word)
training fsts graphs
Choose one training utt
Word Level alignment
Phone level alignments ( CTM style )
Pretty phone alignments
Print the Training graph for one utterance
Other Blogs
Here’s “Kaldi For Dummies”.
Here’s a good blog.
Here’s a good tutorial.
Another tutorial.