Here’s some documentation I’ve been working on as a walk-through of a typical Kaldi run.sh script.
Bayes Rule and ASR
This is from a slide in the first set of slides from Dan Povey’s lectures on using Kaldi.
- comes from our language model (i.e. n-gram)
- is a sentence-dependent statistical model of audio production, trained from data
- Given a test utterance, we pick ‘utterance’ to maximize .
- Note: is a normalizer that doesn’t matter
WFST Key Concepts
- on-demand algorithm
- epsilon removal
HMM Key Concepts
- Markov Chain
- Hidden Markov Model
- Forward-backward algorithm
- Viterbi algorithm
- E-M for mixture of Gaussians
L.fst: The Phonetic Dictionary FST
L maps monophone sequences to words.
The file L.fst is the Finite State Transducer form of the lexicon with phone symbols on the input and word symbols on the output.
See “Speech Recognition with Weighted Finite-State Transducers” by Mohri, Pereira and Riley, in Springer Handbook on SpeechProcessing and Speech Communication, 2008 for more information.
Here’s an example with two words:
The following section comes from the documentation.
The structure of the lexicon is roughly as one might expect. There is one state (the “loop state”) which is final. There is a start state that has two transitions to the loop state: one with silence and one without. From the loop state there is a transition corresponding to each word, and that word is the output symbol on the transition; the input symbol is the first phone of that word. It is important both for the efficiency of composition and the effectiveness of minimization that the output symbol should be as early as possible (i.e. at the beginning not the end of the word). At the end of each word, to handle optional silence, the transition corresponding to the last phone is in two forms: one to the loop state and one to the “silence state” which has a transition to the loop state. We don’t bother putting optional silence after silence words, which we define as words that have just one phone that is the silence phone.
L_disambig.fst: The Phonetic Dictionary with Disambiguation Symbols FST
A lexicon with disambiguation symbols, see Mohri etal’s work for more info.
In general, you need to have disambiguation symbols when you have one word that is a prefix of another (cat and cats in the same lexicon would need to have cat being pronounced “k ae t #1”) or a homophone of another word (red: “r eh d #1”, read: “r eh d #2”). If you don’t have these then the models become nondeterministic.
Symbols like #1 and #2 that go on the ends of words to ensure determinizability.
G.fst: The Language Model FST
FSA grammar (can be built from an n-gram grammar).
C.fst: The Context FST
C maps triphone sequences to monophones.
Expands the phones into context-dependent phones.
H.fst: The HMM FST
H maps multiple HMM states (a.k.a. transition-ids in Kaldi-speak) to context-dependent triphones.
Expands out the HMMs. On the right are the context-dependent phones and on the left are the pdf-ids.
HCLG.fst: final graph
mkgraph.sh: Graph compilation
This script creates a fully expanded decoding graph (HCLG) that represents the language-model, pronunciation dictionary (lexicon), context-dependency, and HMM structure in our model. The output is a Finite State Transducer that has word-ids on the output, and pdf-ids on the input (these are indexes that resolve to Gaussian Mixture Models).
The following files are required:
OpenFst Symbol Tables
Both words.txt and phones.txt are OpenFst Symbol Tables
Symbol (string) to int and reverse mapping.
The SymbolTable implements the mappings of labels to strings and reverse. SymbolTables are used to describe the alphabet of the input and output labels for arcs in a Finite State Transducer.
SymbolTables are reference counted and can therefore be shared across multiple machines. For example a language model grammar G, with a SymbolTable for the words in the language model can share this symbol table with the lexical representation L o G.
For every FST the symbol ‘0’ is reserved for .
An OpenFst symbol table.
The file words.txt is created by prepare_lang.sh and is a list of all words in the vocabulary, in addition to silence markers, and the disambiguation symbol “#0” (used for epsilon on the input of G.fst).
Each word has a unique number.
This file has a single line with the word (not the phone!) for out of vocabulary items.
In my case I’m using “<unk>” because that’s what I get from IRSTLM in my language model (task.arpabo), and this entry has to be identical to that.
In the below simple acoustic model, I have 25 phones. I have a 3-state HMM typology for non-silence phones, and a 5-state typology for silence phones. Since I have 24 non-silence phones, and only one silence phone, I get:
(24 x 3) + (1 x 5) = 77
77 total states in the acousitc model.
Looking at the decision tree for a monophone model, we see there is one
You can easily visualize a tree with Kaldi’s
We end up with a nice graph in the file
Working with Alignments (Phone or Word)
training fsts graphs
Choose one training utt
Word Level alignment
Phone level alignments ( CTM style )
Pretty phone alignments
Print the Training graph for one utterance
Here’s “Kaldi For Dummies”.
Here’s a good blog.
Here’s a good tutorial.