👋 Hi, it’s Josh here. I’m writing you this note in 2021: the world of speech technology has changed dramatically since Kaldi. Before devoting weeks of your time to deploying Kaldi, take a look at 🐸 [Coqui Speech-to-Text][coqui-github]. It takes minutes to deploy an off-the-shelf 🐸 STT model, and it’s [open source on Github][coqui-github]. I’m on the Coqui founding team so I’m admittedly biased. However, you can tell from this blog that I’ve spent years working with Kaldi, so I understand the headaches.

With 🐸 STT, we’ve removed the headaches of Kaldi and streamlined everything for production settings. You can train and deploy state-of-the-art 🐸 Speech-to-Text models in just minutes, not weeks. Check out the [🐸 Model Zoo][coqui-model-zoo] for open, pre-trained models in different languages. Try it out for yourself, and come join our [friendly chatroom][coqui-gitter] 💚





logo

Introduction

In an attempt to be more systematic about tuning my hyperparameters for an nnet3 model, I’ve decided to keep this post as a kind of collection of running notes.



DNN Architecture

Looking through the official example scripts (found in KALDI/egs/), here’s what I find for typical architectures:

Switchboard Corpus

# Taken from script: KALDI/egs/swbd/s5c/local/nnet3/run_tdnn.sh

input dim=100 name=ivector
input dim=40 name=input

# please note that it is important to have input layer with the name=input
# as the layer immediately preceding the fixed-affine-layer to enable
# the use of short notation for the descriptor
fixed-affine-layer name=lda input=Append(-2,-1,0,1,2,ReplaceIndex(ivector, t, 0)) affine-transform-file=$dir/configs/lda.mat

# the first splicing is moved before the lda layer, so no splicing here
relu-renorm-layer name=tdnn1 dim=1024
relu-renorm-layer name=tdnn2 input=Append(-1,2) dim=1024
relu-renorm-layer name=tdnn3 input=Append(-3,3) dim=1024
relu-renorm-layer name=tdnn4 input=Append(-3,3) dim=1024
relu-renorm-layer name=tdnn5 input=Append(-7,2) dim=1024
relu-renorm-layer name=tdnn6 dim=1024

output-layer name=output input=tdnn6 dim=$num_targets max-change=1.5 presoftmax-scale-file=$dir/configs/presoftmax_prior_scale.vec

Wallstreet Journal

# Taken from script KALDI/egs/wsj/s5/local/nnet3/run_tdnn.sh

input dim=100 name=ivector
input dim=40 name=input

# please note that it is important to have input layer with the name=input
# as the layer immediately preceding the fixed-affine-layer to enable
# the use of short notation for the descriptor
fixed-affine-layer name=lda input=Append(-2,-1,0,1,2,ReplaceIndex(ivector, t, 0)) affine-transform-file=$dir/configs/lda.mat

# the first splicing is moved before the lda layer, so no splicing here
relu-renorm-layer name=tdnn1 dim=650
relu-renorm-layer name=tdnn2 dim=650 input=Append(-1,0,1)
relu-renorm-layer name=tdnn3 dim=650 input=Append(-1,0,1)
relu-renorm-layer name=tdnn4 dim=650 input=Append(-3,0,3)
relu-renorm-layer name=tdnn5 dim=650 input=Append(-6,-3,0)
output-layer name=output dim=$num_targets max-change=1.5

Tedlium

Some scripts use the steps/nnet3/tdnn/make_configs.py script to automatically generate the config file, and they look something like this:

# Taken from script: /egs/tedlium/s5/local/nnet3/run_tdnn.sh

# create the config files for nnet initialization
python steps/nnet3/tdnn/make_configs.py  \
    --feat-dir data/${train_set}_hires \
    --ivector-dir exp/nnet3/ivectors_${train_set} \
    --ali-dir $ali_dir \
    --relu-dim 500 \
    --splice-indexes "-1,0,1 -1,0,1,2 -3,0,3 -3,0,3 -3,0,3 -6,-3,0" \
    --use-presoftmax-prior-scale true \
    $dir/configs || exit 1;

Librispeech

# Taken from script: KALDI/egs/librispeech/s5/local/nnet3/run_tdnn.sh

# create the config files for nnet initialization
python steps/nnet3/tdnn/make_configs.py  \
   --feat-dir $train_data_dir \
   --ivector-dir $train_ivector_dir \
   --ali-dir $ali_dir \
   --relu-dim 1280 \
   --splice-indexes "-2,-1,0,1,2 -1,2 -3,3 -7,2 0"  \
   --use-presoftmax-prior-scale true \
   $dir/configs || exit 1;

Fisher Switchboard

# Taken from script: KALDI/egs/fisher_swbd/s5/local/nnet3/run_tdnn.sh

# create the config files for nnet initialization
python steps/nnet3/tdnn/make_configs.py  \
   --feat-dir data/${train_set}_hires \
   --ivector-dir exp/nnet3/ivectors_${train_set} \
   --ali-dir $ali_dir \
   --relu-dim 1024 \
   --splice-indexes "-2,-1,0,1,2 -1,2 -3,3 -3,3 -7,2 0"  \
   --use-presoftmax-prior-scale true \



Learning Rates

babel/s5d/local/chain/run_tdnn.sh:    --trainer.optimization.initial-effective-lrate 0.001 \
babel/s5d/local/chain/run_tdnn.sh:    --trainer.optimization.final-effective-lrate 0.0001 \
babel/s5d/local/nnet3/run_tdnn.sh:    --initial-effective-lrate 0.0017 --final-effective-lrate 0.00017 \
callhome_egyptian/s5/local/nnet3/run_tdnn.sh:    --initial-effective-lrate 0.005 --final-effective-lrate 0.0005 \
aspire/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.initial-effective-lrate 0.0017 \
aspire/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.final-effective-lrate 0.00017 \
librispeech/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.initial-effective-lrate 0.0017 \
librispeech/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.final-effective-lrate 0.00017 \
iban/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.initial-effective-lrate 0.005 \
iban/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.final-effective-lrate 0.0005 \
fisher_swbd/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.initial-effective-lrate 0.0017 \
fisher_swbd/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.final-effective-lrate 0.00017 \
ami/s5b/local/chain/multi_condition/run_tdnn.sh:    --trainer.optimization.initial-effective-lrate 0.001 \
ami/s5b/local/chain/multi_condition/run_tdnn.sh:    --trainer.optimization.final-effective-lrate 0.0001 \
ami/s5b/local/nnet3/run_tdnn.sh:    --initial-effective-lrate 0.0015 --final-effective-lrate 0.00015 \
ami/s5/local/nnet3/run_tdnn.sh:    --initial-effective-lrate 0.0015 --final-effective-lrate 0.00015 \
tedlium/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.initial-effective-lrate 0.0015 \
tedlium/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.final-effective-lrate 0.00015 \
sprakbanken/s5/local/nnet3/run_tdnn.sh:    --initial-effective-lrate 0.0015 --final-effective-lrate 0.00015 \
multi_en/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.initial-effective-lrate 0.0017 \
multi_en/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.final-effective-lrate 0.00017 \
csj/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.initial-effective-lrate 0.0017 \
csj/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.final-effective-lrate 0.00017 \



Number of Training Epochs

iban/s5/local/nnet3/run_tdnn.sh:    --trainer.num-epochs 4 \
iban/s5/local/nnet3/run_tdnn.sh:    --trainer.num-epochs 2 \
callhome_egyptian/s5/local/nnet3/run_tdnn.sh:    --num-epochs 8 --num-jobs-initial 2 --num-jobs-final 14 \
aspire/s5/local/nnet3/run_tdnn.sh:    --trainer.num-epochs 3 \
tedlium/s5/local/chain/run_tdnn.sh:num_epochs=4
tedlium/s5/local/nnet3/run_tdnn.sh:    --trainer.num-epochs 2 \
sprakbanken/s5/local/nnet3/run_tdnn.sh:num_epochs=3
librispeech/s5/local/nnet3/run_tdnn.sh:    --trainer.num-epochs 4 \
wsj/s5/steps/nnet3/train_tdnn.sh:num_epochs=15      # Number of epochs of training;
wsj/s5/steps/nnet3/chain/train_tdnn.sh:num_epochs=10      # Number of epochs of training;
ami/s5b/local/chain/multi_condition/run_tdnn.sh:    --trainer.num-epochs 4 \
ami/s5b/local/nnet3/run_tdnn.sh:num_epochs=3
ami/s5/local/nnet3/run_tdnn.sh:num_epochs=3
hkust/s5/local/chain/run_tdnn.sh:num_epochs=4
hkust/s5/local/nnet3/run_tdnn.sh:num_epochs=4
csj/s5/local/nnet3/run_tdnn.sh:    --trainer.num-epochs 2 \
multi_en/s5/local/nnet3/run_tdnn.sh:    --trainer.num-epochs 4 \
fisher_swbd/s5/local/nnet3/run_tdnn.sh:    --trainer.num-epochs 4 \
aishell/s5/local/nnet3/run_tdnn.sh:num_epochs=4
babel/s5d/local/chain/run_tdnn.sh:    --trainer.num-epochs 4 \
babel/s5d/local/nnet3/run_tdnn.sh:    --num-epochs 5 --num-jobs-initial 2 --num-jobs-final 6 \
fisher_english/s5/local/chain/run_tdnn.sh:num_epochs=4



Number of Jobs Initial / Final

TL;DR

  1. small data ==> small num jobs
  2. few GPUs ==> small num jobs

Some insights

Dan Povey’s reponse to a question about optimizing num-jobs-intial and num-jobs-final:

Q: In training nnet2/nnet3 networks, do the num-jobs-initial/final parameters make any important difference to the result, or is this just an efficiency heuristic to distribute as many jobs as the nnet can deal with (in terms of averaging the parallel jobs results I suppose)? [ David van Leeuwen ]

A: They do make a bit of a difference, but in general fewer is better [except if your system has too many parameters and is overtraining excessively, but best the solution for that is to use fewer epochs or fewer parameters.]

… in general it’s best to use no more jobs than the number of GPUs that you have. And when reducing the num-jobs, it might be a good idea to reduce the number of epochs slightly. [ Dan Povey ]

Another response from Povey on the Kaldi teams views towards these parameters:

Q: if I change the value of num-jobs-initial and num-jobs-final : do I need to change also the other parameters like the learning rate? changing the values of num-jobs-initial and num-jobs-final doesn’t affect the accuracy of the training ? That’s why I wonder why the author of run_tdnn.sh has chosen the value “2” and “12” for these parameters, it’s the same question when I see others parameters in kaldi scripts like “$nj” set to a precise value, if we change $nj does it affect the final accuracy and the WER ? [ Mannix ]

A: You don’t have to change the learning rate, those are automatically adjusted in a suitable way. It may affect results, and it may be a good idea to reduce the num-epochs if you reduce the num-jobs, but we normally don’t tune the num-jobs for WER (we choose it based on convenience in training time) so the WER could be better or worse. In general, for non-neural-net scripts and for decoding, the $nj option makes no difference to the results. [ Dan Povey ]

iban/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.num-jobs-initial 2 \
iban/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.num-jobs-final 4 \
callhome_egyptian/s5/local/nnet3/run_tdnn.sh:    --num-epochs 8 --num-jobs-initial 2 --num-jobs-final 14 \
aspire/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.num-jobs-initial 4 \
aspire/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.num-jobs-final 22 \
tedlium/s5/local/chain/run_tdnn.sh:num_jobs_initial=3
tedlium/s5/local/chain/run_tdnn.sh:num_jobs_final=8
tedlium/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.num-jobs-initial 3 \
tedlium/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.num-jobs-final 8 \
sprakbanken/s5/local/nnet3/run_tdnn.sh:    --num-epochs $num_epochs --num-jobs-initial 2 --num-jobs-final 12 \
librispeech/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.num-jobs-initial 3 \
librispeech/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.num-jobs-final 16 \
ami/s5b/local/chain/multi_condition/run_tdnn.sh:    --trainer.optimization.num-jobs-initial 2 \
ami/s5b/local/chain/multi_condition/run_tdnn.sh:    --trainer.optimization.num-jobs-final 12 \
ami/s5b/local/nnet3/run_tdnn.sh:    --num-epochs $num_epochs --num-jobs-initial 2 --num-jobs-final 12 \
ami/s5/local/nnet3/run_tdnn.sh:    --num-epochs $num_epochs --num-jobs-initial 2 --num-jobs-final 12 \
hkust/s5/local/chain/run_tdnn.sh:num_jobs_initial=2
hkust/s5/local/chain/run_tdnn.sh:num_jobs_final=12
hkust/s5/local/nnet3/run_tdnn.sh:num_jobs_initial=2
hkust/s5/local/nnet3/run_tdnn.sh:num_jobs_final=12
csj/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.num-jobs-initial 1 \
csj/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.num-jobs-final 4 \
multi_en/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.num-jobs-initial 3 \
multi_en/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.num-jobs-final 16 \
fisher_swbd/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.num-jobs-initial 3 \
fisher_swbd/s5/local/nnet3/run_tdnn.sh:    --trainer.optimization.num-jobs-final 16 \
aishell/s5/local/nnet3/run_tdnn.sh:num_jobs_initial=2
aishell/s5/local/nnet3/run_tdnn.sh:num_jobs_final=12
babel/s5d/local/chain/run_tdnn.sh:    --trainer.optimization.num-jobs-initial 2 \
babel/s5d/local/chain/run_tdnn.sh:    --trainer.optimization.num-jobs-final 12 \
babel/s5d/local/nnet3/run_tdnn.sh:    --num-epochs 5 --num-jobs-initial 2 --num-jobs-final 6 \
fisher_english/s5/local/chain/run_tdnn.sh:num_jobs_initial=3
fisher_english/s5/local/chain/run_tdnn.sh:num_jobs_final=16



Nnet3Eg

How nnet3 examples are saved to disk.

Working with TDNN or another kind of feed-forward (non-recurrent) neural net, we will find the training examples saved to disk in archive files, in pairs of [data,label].

You can use something like nnet3-copy-egs ark:egs.1.ark ark,t:- | head to take a quick look at your data in human-readable format.

Here’s one training example from some librispeech data:

libri-org_19-198-0024-1068 <Nnet3Eg> <NumIo> 2 <NnetIo> input <I1V> 36 <I1> 0 -16 0 <I1> 0 -15 0 <I1> 0 -14 0 <I1> 0 -13 0 <I1> 0 -12 0 <I1> 0 -11 0 <I1> 0 -10 0 <I1> 0 -9 0 <I1> 0 -8 0 <I1> 0 -7 0 <I1> 0 -6 0 <I1> 0 -5 0 <I1> 0 -4 0 <I1> 0 -3 0 <I1> 0 -2 0 <I1> 0 -1 0 <I1> 0 0 0 <I1> 0 1 0 <I1> 0 2 0 <I1> 0 3 0 <I1> 0 4 0 <I1> 0 5 0 <I1> 0 6 0 <I1> 0 7 0 <I1> 0 8 0 <I1> 0 9 0 <I1> 0 10 0 <I1> 0 11 0 <I1> 0 12 0 <I1> 0 13 0 <I1> 0 14 0 <I1> 0 15 0 <I1> 0 16 0 <I1> 0 17 0 <I1> 0 18 0 <I1> 0 19 0  [
  21.62141 -1.176828 -3.170161 -0.491064 -2.720289 -2.210604 0.3688199 1.155272 -0.03023905 -0.007696561 1.108493 0.08210284 0.2613205 
  21.7017 -1.115339 -3.24841 -0.3867778 -2.922514 -2.095302 0.3688199 1.306337 -0.2039333 -0.04113591 1.032548 0.2306411 0.1930926 
  21.72846 -1.099967 -3.274493 -0.4165739 -3.12474 -1.66292 0.3133322 1.284756 -0.5039507 -0.02415323 0.8300285 0.4892248 0.2112868 
  21.83552 -1.146083 -3.24841 -0.7294323 -3.038072 -1.472113 0.3965637 1.176853 -0.5039507 -0.2279451 0.7034538 0.5092533 0.247675 
  22.04962 -1.289638 -3.170161 -1.161475 -2.893625 -1.256928 0.2301007 1.349498 -0.7807492 -0.06377941 0.2836092 0.6093962 0.256772 
  22.15668 -1.474328 -3.013664 -1.454724 -2.575842 -1.211283 0.2855884 1.371079 -0.8364822 -0.3694672 0.3123612 0.5893676 0.1658015 
  22.18344 -1.505109 -2.909332 -1.777845 -2.172618 -1.341698 0.8127213 0.7236576 0.02502733 -1.455261 0.8047136 0.2306411 0.07483093 
  22.15668 -1.535891 -2.805 -2.10718 -1.83891 -1.17868 0.701746 0.5687855 0.3680729 -1.934681 0.5596279 0.245495 0.1521559 
  22.10315 -1.361462 -2.831083 -2.036757 -1.720258 -0.9439335 0.2301007 0.8099805 0.4093249 -1.985147 0.1053472 0.2752026 0.420519 
  21.88904 -1.238335 -2.805 -2.177604 -1.594191 -0.9634957 -0.1877976 0.9178841 0.8424718 -1.959914 -0.009660542 0.3494717 0.5604483 
  21.59465 -1.099967 -2.648503 -2.388876 -1.453292 -1.09391 -0.3819043 1.047368 0.450577 -1.505726 -0.06716442 0.2752026 0.5433292 
  21.46083 -1.146083 -2.335508 -2.600147 -1.127743 -0.8787263 -0.355131 0.9394647 -0.07761019 -1.329098 -0.1756579 -0.08624053 0.6118056 
  21.75523 -1.161456 -2.465923 -2.494512 -1.403208 -0.7074229 -0.455531 0.9826261 0.001341701 -1.581424 0.1513503 -0.3143125 0.6289247 
  22.10315 -0.9923623 -2.492005 -2.776207 -1.794415 -0.8656848 -0.8301595 1.500563 0.6362113 -1.556191 0.1053472 -0.05653287 0.9370683 
  22.26373 -1.069223 -2.62242 -2.635359 -1.920483 -1.367781 -0.5425444 1.737951 0.429951 -1.329098 0.1398495 0.0722003 0.9541874 
  22.18344 -1.263986 -2.43984 -2.4593 -2.04655 -1.634095 -0.06731755 1.910597 -0.07761019 -1.430028 0.4388697 0.1167618 0.5775674 
  21.83552 -1.494849 -2.257259 -2.142392 -2.269022 -1.66292 0.2855884 1.781112 -0.5671122 -1.329098 0.8553435 -0.2794969 0.5604483 
  21.16644 -1.643627 -1.99643 -1.808915 -2.460284 -1.634095 0.3133322 1.371079 -0.9479482 -0.8749096 1.235068 -0.5057982 0.5387807 
  20.46527 -1.70519 -1.803653 -1.566574 -2.575842 -1.439509 0.04646915 0.7452384 -0.8643488 -0.1317101 1.235068 -0.3839437 0.7316392 
  20.20641 -1.582063 -1.731094 -1.454724 -2.864736 -1.511237 -0.007077515 0.5687855 -0.8364822 0.3708069 1.057863 -0.2969047 0.7145202 
  20.09547 -1.274247 -1.607744 -1.36773 -3.066961 -1.951175 0.05985582 0.8963033 -0.8364822 0.1404135 0.7540837 -0.2794969 0.9028301 
  19.87359 -0.8232691 -1.390066 -1.212383 -3.038072 -2.066477 -0.1944909 1.09053 -1.031548 -0.05811852 0.8806584 -0.4187592 0.8001156 
  19.42984 -0.8540133 -1.549696 -1.042291 -2.806957 -1.807048 -0.4086776 0.9826261 -0.8922153 0.4037202 0.5136248 -0.8887694 0.697401 
  19.134 -1.069223 -1.324763 -0.9380047 -2.575842 -1.53732 -0.7579905 0.9826261 -1.588878 0.8315937 0.6528239 -0.6102449 0.07483093 
  18.9491 -1.130711 -1.41909 -1.161475 -2.460284 -1.45255 -0.4220643 1.198433 -1.839677 0.2885236 0.07084489 -0.766915 0.2795147 
  18.9491 -1.084595 -1.426346 -1.268308 -2.358011 -1.517758 -0.06062418 1.39266 -1.97901 0.1897835 -0.3405299 -0.7843227 0.5114896 
  18.83817 -1.161456 -1.411834 -1.012495 -2.261606 -1.335177 -0.07401091 0.9178841 -1.867543 0.1733268 -0.2745811 -0.9409928 0.6631629 
  18.57931 -1.161456 -1.433602 -0.5357581 -2.105876 -1.387343 -0.8301595 0.7236576 -1.783944 0.6999404 0.03059217 -0.8539539 0.4796498 
  18.76421 -1.130711 -1.375555 -0.7443304 -2.283854 -1.465592 -0.7579905 1.133691 -1.839677 0.5518303 -0.3240427 -0.3143125 0.3932278 
  19.06004 -0.9923623 -1.201413 -1.042291 -2.489173 -1.66292 -0.7098777 1.068949 -1.561012 0.8809637 -0.1189179 -0.01692267 0.1157677 
  19.02306 -0.8693854 -1.092574 -1.116781 -2.720289 -1.691746 -0.676411 1.047368 -1.616745 0.765767 0.5193751 -0.3143125 -0.1719818 
  18.76421 -0.8386412 -1.215925 -0.9678007 -2.720289 -1.413426 -0.926385 1.09053 -1.839677 0.2391535 0.473372 -0.4535748 -0.1719818 
  18.46837 -0.8386412 -1.121598 -1.072087 -2.489173 -1.45255 -0.9985542 1.241595 -1.97901 -0.02415323 0.3411131 -0.4709826 0.08392798 
  18.32045 -0.8386412 -0.9619678 -1.305591 -2.269022 -1.634095 -0.3149709 0.4888108 -2.146209 0.05813013 0.1858526 -0.3839437 0.1976412 
  18.28347 -0.7310364 -1.012759 -1.268308 -2.604731 -1.159118 -0.2614243 -0.07101282 -2.508473 0.2885236 -0.06141403 -0.4709826 0.03844271 
  18.32045 -0.5619432 -1.150621 -1.286949 -2.431395 -1.439509 -0.1543309 -0.4552882 -2.006876 -0.02981412 0.1858526 -0.5057982 -0.0610655 ]
</NnetIo> <NnetIo> output <I1V> 8 <I1> 0 0 0 <I1> 0 1 0 <I1> 0 2 0 <I1> 0 3 0 <I1> 0 4 0 <I1> 0 5 0 <I1> 0 6 0 <I1> 0 7 0 rows=8 dim=856 [ 524 1 ] dim=856 [ 524 1 ] dim=856 [ 524 1 ] dim=856 [ 178 1 ] dim=856 [ 533 1 ] dim=856 [ 480 1 ] dim=856 [ 480 1 ] dim=856 [ 240 1 ] 

Now let’s break down this big mess.

First we get info about the audio file this example comes from: libri-org_19-198-0024-1068

Then we get <NumIo> 2, which tells us

We can find how many frames of audio are in this example: <I1V> 36

Then we see a long list of things like <I1> 0 1 0, one tag for each of our 36 frames, where the middle number is the relative position to the central frame.

Then comes the audio features: 21.62141 -1.176828 -3.170161 -0.491064 -2.720289 -2.210604 0.3688199 1.155272 -0.03023905 -0.007696561 1.108493 0.08210284 0.2613205 one row of features for each frame.

After all the audio features comes info about the labels:

For our 36 audio frames, we have eight labels: <I1V> 8, this has to do with splicing.

Next comes, like above, information about relative location of labels: <I1> 0 4 0

Given that these labels are one-hot vectors, they are represented as just their indeces: dim=856 [ 524 1 ] dim=856 [ 178 1 ] the dim=856 reminds us how many senomes are in the model (hence dimensions in the target vectors. The [ 524 1 ]` tell us the senome ID and the weight of the posterior. This is in my case trivial, because this translates to “the current frame has the label 524 with 100% certainty.”

Other Documentation

Here’s some documentation I wrote up as a walk-through of a typical Kaldi GMM run.sh script.

Here’s some more general Kaldi notes I’ve put together. They cover topics like HCLG and file formats.