Kaldi nnet3 notes
👋 Hi, it’s Josh here. I’m writing you this note in 2021: the world of speech technology has changed dramatically since Kaldi. Before devoting weeks of your time to deploying Kaldi, take a look at 🐸 [Coqui Speech-to-Text][coqui-github]. It takes minutes to deploy an off-the-shelf 🐸 STT model, and it’s [open source on Github][coqui-github]. I’m on the Coqui founding team so I’m admittedly biased. However, you can tell from this blog that I’ve spent years working with Kaldi, so I understand the headaches.
With 🐸 STT, we’ve removed the headaches of Kaldi and streamlined everything for production settings. You can train and deploy state-of-the-art 🐸 Speech-to-Text models in just minutes, not weeks. Check out the [🐸 Model Zoo][coqui-model-zoo] for open, pre-trained models in different languages. Try it out for yourself, and come join our [friendly chatroom][coqui-gitter] 💚
Introduction
In an attempt to be more systematic about tuning my hyperparameters for an nnet3
model, I’ve decided to keep this post as a kind of collection of running notes.
DNN Architecture
Looking through the official example scripts (found in KALDI/egs/
), here’s what I find for typical architectures:
Switchboard Corpus
# Taken from script: KALDI/egs/swbd/s5c/local/nnet3/run_tdnn.sh
input dim=100 name=ivector
input dim=40 name=input
# please note that it is important to have input layer with the name=input
# as the layer immediately preceding the fixed-affine-layer to enable
# the use of short notation for the descriptor
fixed-affine-layer name=lda input=Append(-2,-1,0,1,2,ReplaceIndex(ivector, t, 0)) affine-transform-file=$dir/configs/lda.mat
# the first splicing is moved before the lda layer, so no splicing here
relu-renorm-layer name=tdnn1 dim=1024
relu-renorm-layer name=tdnn2 input=Append(-1,2) dim=1024
relu-renorm-layer name=tdnn3 input=Append(-3,3) dim=1024
relu-renorm-layer name=tdnn4 input=Append(-3,3) dim=1024
relu-renorm-layer name=tdnn5 input=Append(-7,2) dim=1024
relu-renorm-layer name=tdnn6 dim=1024
output-layer name=output input=tdnn6 dim=$num_targets max-change=1.5 presoftmax-scale-file=$dir/configs/presoftmax_prior_scale.vec
Wallstreet Journal
# Taken from script KALDI/egs/wsj/s5/local/nnet3/run_tdnn.sh
input dim=100 name=ivector
input dim=40 name=input
# please note that it is important to have input layer with the name=input
# as the layer immediately preceding the fixed-affine-layer to enable
# the use of short notation for the descriptor
fixed-affine-layer name=lda input=Append(-2,-1,0,1,2,ReplaceIndex(ivector, t, 0)) affine-transform-file=$dir/configs/lda.mat
# the first splicing is moved before the lda layer, so no splicing here
relu-renorm-layer name=tdnn1 dim=650
relu-renorm-layer name=tdnn2 dim=650 input=Append(-1,0,1)
relu-renorm-layer name=tdnn3 dim=650 input=Append(-1,0,1)
relu-renorm-layer name=tdnn4 dim=650 input=Append(-3,0,3)
relu-renorm-layer name=tdnn5 dim=650 input=Append(-6,-3,0)
output-layer name=output dim=$num_targets max-change=1.5
Tedlium
Some scripts use the steps/nnet3/tdnn/make_configs.py
script to automatically generate the config file, and they look something like this:
# Taken from script: /egs/tedlium/s5/local/nnet3/run_tdnn.sh
# create the config files for nnet initialization
python steps/nnet3/tdnn/make_configs.py \
--feat-dir data/${train_set}_hires \
--ivector-dir exp/nnet3/ivectors_${train_set} \
--ali-dir $ali_dir \
--relu-dim 500 \
--splice-indexes "-1,0,1 -1,0,1,2 -3,0,3 -3,0,3 -3,0,3 -6,-3,0" \
--use-presoftmax-prior-scale true \
$dir/configs || exit 1;
Librispeech
# Taken from script: KALDI/egs/librispeech/s5/local/nnet3/run_tdnn.sh
# create the config files for nnet initialization
python steps/nnet3/tdnn/make_configs.py \
--feat-dir $train_data_dir \
--ivector-dir $train_ivector_dir \
--ali-dir $ali_dir \
--relu-dim 1280 \
--splice-indexes "-2,-1,0,1,2 -1,2 -3,3 -7,2 0" \
--use-presoftmax-prior-scale true \
$dir/configs || exit 1;
Fisher Switchboard
# Taken from script: KALDI/egs/fisher_swbd/s5/local/nnet3/run_tdnn.sh
# create the config files for nnet initialization
python steps/nnet3/tdnn/make_configs.py \
--feat-dir data/${train_set}_hires \
--ivector-dir exp/nnet3/ivectors_${train_set} \
--ali-dir $ali_dir \
--relu-dim 1024 \
--splice-indexes "-2,-1,0,1,2 -1,2 -3,3 -3,3 -7,2 0" \
--use-presoftmax-prior-scale true \
Learning Rates
babel/s5d/local/chain/run_tdnn.sh: --trainer.optimization.initial-effective-lrate 0.001 \
babel/s5d/local/chain/run_tdnn.sh: --trainer.optimization.final-effective-lrate 0.0001 \
babel/s5d/local/nnet3/run_tdnn.sh: --initial-effective-lrate 0.0017 --final-effective-lrate 0.00017 \
callhome_egyptian/s5/local/nnet3/run_tdnn.sh: --initial-effective-lrate 0.005 --final-effective-lrate 0.0005 \
aspire/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.initial-effective-lrate 0.0017 \
aspire/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.final-effective-lrate 0.00017 \
librispeech/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.initial-effective-lrate 0.0017 \
librispeech/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.final-effective-lrate 0.00017 \
iban/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.initial-effective-lrate 0.005 \
iban/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.final-effective-lrate 0.0005 \
fisher_swbd/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.initial-effective-lrate 0.0017 \
fisher_swbd/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.final-effective-lrate 0.00017 \
ami/s5b/local/chain/multi_condition/run_tdnn.sh: --trainer.optimization.initial-effective-lrate 0.001 \
ami/s5b/local/chain/multi_condition/run_tdnn.sh: --trainer.optimization.final-effective-lrate 0.0001 \
ami/s5b/local/nnet3/run_tdnn.sh: --initial-effective-lrate 0.0015 --final-effective-lrate 0.00015 \
ami/s5/local/nnet3/run_tdnn.sh: --initial-effective-lrate 0.0015 --final-effective-lrate 0.00015 \
tedlium/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.initial-effective-lrate 0.0015 \
tedlium/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.final-effective-lrate 0.00015 \
sprakbanken/s5/local/nnet3/run_tdnn.sh: --initial-effective-lrate 0.0015 --final-effective-lrate 0.00015 \
multi_en/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.initial-effective-lrate 0.0017 \
multi_en/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.final-effective-lrate 0.00017 \
csj/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.initial-effective-lrate 0.0017 \
csj/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.final-effective-lrate 0.00017 \
Number of Training Epochs
iban/s5/local/nnet3/run_tdnn.sh: --trainer.num-epochs 4 \
iban/s5/local/nnet3/run_tdnn.sh: --trainer.num-epochs 2 \
callhome_egyptian/s5/local/nnet3/run_tdnn.sh: --num-epochs 8 --num-jobs-initial 2 --num-jobs-final 14 \
aspire/s5/local/nnet3/run_tdnn.sh: --trainer.num-epochs 3 \
tedlium/s5/local/chain/run_tdnn.sh:num_epochs=4
tedlium/s5/local/nnet3/run_tdnn.sh: --trainer.num-epochs 2 \
sprakbanken/s5/local/nnet3/run_tdnn.sh:num_epochs=3
librispeech/s5/local/nnet3/run_tdnn.sh: --trainer.num-epochs 4 \
wsj/s5/steps/nnet3/train_tdnn.sh:num_epochs=15 # Number of epochs of training;
wsj/s5/steps/nnet3/chain/train_tdnn.sh:num_epochs=10 # Number of epochs of training;
ami/s5b/local/chain/multi_condition/run_tdnn.sh: --trainer.num-epochs 4 \
ami/s5b/local/nnet3/run_tdnn.sh:num_epochs=3
ami/s5/local/nnet3/run_tdnn.sh:num_epochs=3
hkust/s5/local/chain/run_tdnn.sh:num_epochs=4
hkust/s5/local/nnet3/run_tdnn.sh:num_epochs=4
csj/s5/local/nnet3/run_tdnn.sh: --trainer.num-epochs 2 \
multi_en/s5/local/nnet3/run_tdnn.sh: --trainer.num-epochs 4 \
fisher_swbd/s5/local/nnet3/run_tdnn.sh: --trainer.num-epochs 4 \
aishell/s5/local/nnet3/run_tdnn.sh:num_epochs=4
babel/s5d/local/chain/run_tdnn.sh: --trainer.num-epochs 4 \
babel/s5d/local/nnet3/run_tdnn.sh: --num-epochs 5 --num-jobs-initial 2 --num-jobs-final 6 \
fisher_english/s5/local/chain/run_tdnn.sh:num_epochs=4
Number of Jobs Initial / Final
TL;DR
- small data ==> small num jobs
- few GPUs ==> small num jobs
Some insights
Dan Povey’s reponse to a question about optimizing num-jobs-intial
and num-jobs-final
:
Q: In training nnet2/nnet3 networks, do the num-jobs-initial/final parameters make any important difference to the result, or is this just an efficiency heuristic to distribute as many jobs as the nnet can deal with (in terms of averaging the parallel jobs results I suppose)? [ David van Leeuwen ]
A: They do make a bit of a difference, but in general fewer is better [except if your system has too many parameters and is overtraining excessively, but best the solution for that is to use fewer epochs or fewer parameters.]
… in general it’s best to use no more jobs than the number of GPUs that you have. And when reducing the num-jobs, it might be a good idea to reduce the number of epochs slightly. [ Dan Povey ]
Another response from Povey on the Kaldi teams views towards these parameters:
Q: if I change the value of num-jobs-initial and num-jobs-final : do I need to change also the other parameters like the learning rate? changing the values of num-jobs-initial and num-jobs-final doesn’t affect the accuracy of the training ? That’s why I wonder why the author of run_tdnn.sh has chosen the value “2” and “12” for these parameters, it’s the same question when I see others parameters in kaldi scripts like “$nj” set to a precise value, if we change $nj does it affect the final accuracy and the WER ? [ Mannix ]
A: You don’t have to change the learning rate, those are automatically adjusted in a suitable way. It may affect results, and it may be a good idea to reduce the num-epochs if you reduce the num-jobs, but we normally don’t tune the num-jobs for WER (we choose it based on convenience in training time) so the WER could be better or worse. In general, for non-neural-net scripts and for decoding, the $nj option makes no difference to the results. [ Dan Povey ]
iban/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.num-jobs-initial 2 \
iban/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.num-jobs-final 4 \
callhome_egyptian/s5/local/nnet3/run_tdnn.sh: --num-epochs 8 --num-jobs-initial 2 --num-jobs-final 14 \
aspire/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.num-jobs-initial 4 \
aspire/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.num-jobs-final 22 \
tedlium/s5/local/chain/run_tdnn.sh:num_jobs_initial=3
tedlium/s5/local/chain/run_tdnn.sh:num_jobs_final=8
tedlium/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.num-jobs-initial 3 \
tedlium/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.num-jobs-final 8 \
sprakbanken/s5/local/nnet3/run_tdnn.sh: --num-epochs $num_epochs --num-jobs-initial 2 --num-jobs-final 12 \
librispeech/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.num-jobs-initial 3 \
librispeech/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.num-jobs-final 16 \
ami/s5b/local/chain/multi_condition/run_tdnn.sh: --trainer.optimization.num-jobs-initial 2 \
ami/s5b/local/chain/multi_condition/run_tdnn.sh: --trainer.optimization.num-jobs-final 12 \
ami/s5b/local/nnet3/run_tdnn.sh: --num-epochs $num_epochs --num-jobs-initial 2 --num-jobs-final 12 \
ami/s5/local/nnet3/run_tdnn.sh: --num-epochs $num_epochs --num-jobs-initial 2 --num-jobs-final 12 \
hkust/s5/local/chain/run_tdnn.sh:num_jobs_initial=2
hkust/s5/local/chain/run_tdnn.sh:num_jobs_final=12
hkust/s5/local/nnet3/run_tdnn.sh:num_jobs_initial=2
hkust/s5/local/nnet3/run_tdnn.sh:num_jobs_final=12
csj/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.num-jobs-initial 1 \
csj/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.num-jobs-final 4 \
multi_en/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.num-jobs-initial 3 \
multi_en/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.num-jobs-final 16 \
fisher_swbd/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.num-jobs-initial 3 \
fisher_swbd/s5/local/nnet3/run_tdnn.sh: --trainer.optimization.num-jobs-final 16 \
aishell/s5/local/nnet3/run_tdnn.sh:num_jobs_initial=2
aishell/s5/local/nnet3/run_tdnn.sh:num_jobs_final=12
babel/s5d/local/chain/run_tdnn.sh: --trainer.optimization.num-jobs-initial 2 \
babel/s5d/local/chain/run_tdnn.sh: --trainer.optimization.num-jobs-final 12 \
babel/s5d/local/nnet3/run_tdnn.sh: --num-epochs 5 --num-jobs-initial 2 --num-jobs-final 6 \
fisher_english/s5/local/chain/run_tdnn.sh:num_jobs_initial=3
fisher_english/s5/local/chain/run_tdnn.sh:num_jobs_final=16
Nnet3Eg
How nnet3
examples are saved to disk.
Working with TDNN or another kind of feed-forward (non-recurrent) neural net, we will find the training examples saved to disk in archive files, in pairs of [data,label].
You can use something like nnet3-copy-egs ark:egs.1.ark ark,t:- | head
to take a quick look at your data in human-readable format.
Here’s one training example from some librispeech
data:
libri-org_19-198-0024-1068 <Nnet3Eg> <NumIo> 2 <NnetIo> input <I1V> 36 <I1> 0 -16 0 <I1> 0 -15 0 <I1> 0 -14 0 <I1> 0 -13 0 <I1> 0 -12 0 <I1> 0 -11 0 <I1> 0 -10 0 <I1> 0 -9 0 <I1> 0 -8 0 <I1> 0 -7 0 <I1> 0 -6 0 <I1> 0 -5 0 <I1> 0 -4 0 <I1> 0 -3 0 <I1> 0 -2 0 <I1> 0 -1 0 <I1> 0 0 0 <I1> 0 1 0 <I1> 0 2 0 <I1> 0 3 0 <I1> 0 4 0 <I1> 0 5 0 <I1> 0 6 0 <I1> 0 7 0 <I1> 0 8 0 <I1> 0 9 0 <I1> 0 10 0 <I1> 0 11 0 <I1> 0 12 0 <I1> 0 13 0 <I1> 0 14 0 <I1> 0 15 0 <I1> 0 16 0 <I1> 0 17 0 <I1> 0 18 0 <I1> 0 19 0 [
21.62141 -1.176828 -3.170161 -0.491064 -2.720289 -2.210604 0.3688199 1.155272 -0.03023905 -0.007696561 1.108493 0.08210284 0.2613205
21.7017 -1.115339 -3.24841 -0.3867778 -2.922514 -2.095302 0.3688199 1.306337 -0.2039333 -0.04113591 1.032548 0.2306411 0.1930926
21.72846 -1.099967 -3.274493 -0.4165739 -3.12474 -1.66292 0.3133322 1.284756 -0.5039507 -0.02415323 0.8300285 0.4892248 0.2112868
21.83552 -1.146083 -3.24841 -0.7294323 -3.038072 -1.472113 0.3965637 1.176853 -0.5039507 -0.2279451 0.7034538 0.5092533 0.247675
22.04962 -1.289638 -3.170161 -1.161475 -2.893625 -1.256928 0.2301007 1.349498 -0.7807492 -0.06377941 0.2836092 0.6093962 0.256772
22.15668 -1.474328 -3.013664 -1.454724 -2.575842 -1.211283 0.2855884 1.371079 -0.8364822 -0.3694672 0.3123612 0.5893676 0.1658015
22.18344 -1.505109 -2.909332 -1.777845 -2.172618 -1.341698 0.8127213 0.7236576 0.02502733 -1.455261 0.8047136 0.2306411 0.07483093
22.15668 -1.535891 -2.805 -2.10718 -1.83891 -1.17868 0.701746 0.5687855 0.3680729 -1.934681 0.5596279 0.245495 0.1521559
22.10315 -1.361462 -2.831083 -2.036757 -1.720258 -0.9439335 0.2301007 0.8099805 0.4093249 -1.985147 0.1053472 0.2752026 0.420519
21.88904 -1.238335 -2.805 -2.177604 -1.594191 -0.9634957 -0.1877976 0.9178841 0.8424718 -1.959914 -0.009660542 0.3494717 0.5604483
21.59465 -1.099967 -2.648503 -2.388876 -1.453292 -1.09391 -0.3819043 1.047368 0.450577 -1.505726 -0.06716442 0.2752026 0.5433292
21.46083 -1.146083 -2.335508 -2.600147 -1.127743 -0.8787263 -0.355131 0.9394647 -0.07761019 -1.329098 -0.1756579 -0.08624053 0.6118056
21.75523 -1.161456 -2.465923 -2.494512 -1.403208 -0.7074229 -0.455531 0.9826261 0.001341701 -1.581424 0.1513503 -0.3143125 0.6289247
22.10315 -0.9923623 -2.492005 -2.776207 -1.794415 -0.8656848 -0.8301595 1.500563 0.6362113 -1.556191 0.1053472 -0.05653287 0.9370683
22.26373 -1.069223 -2.62242 -2.635359 -1.920483 -1.367781 -0.5425444 1.737951 0.429951 -1.329098 0.1398495 0.0722003 0.9541874
22.18344 -1.263986 -2.43984 -2.4593 -2.04655 -1.634095 -0.06731755 1.910597 -0.07761019 -1.430028 0.4388697 0.1167618 0.5775674
21.83552 -1.494849 -2.257259 -2.142392 -2.269022 -1.66292 0.2855884 1.781112 -0.5671122 -1.329098 0.8553435 -0.2794969 0.5604483
21.16644 -1.643627 -1.99643 -1.808915 -2.460284 -1.634095 0.3133322 1.371079 -0.9479482 -0.8749096 1.235068 -0.5057982 0.5387807
20.46527 -1.70519 -1.803653 -1.566574 -2.575842 -1.439509 0.04646915 0.7452384 -0.8643488 -0.1317101 1.235068 -0.3839437 0.7316392
20.20641 -1.582063 -1.731094 -1.454724 -2.864736 -1.511237 -0.007077515 0.5687855 -0.8364822 0.3708069 1.057863 -0.2969047 0.7145202
20.09547 -1.274247 -1.607744 -1.36773 -3.066961 -1.951175 0.05985582 0.8963033 -0.8364822 0.1404135 0.7540837 -0.2794969 0.9028301
19.87359 -0.8232691 -1.390066 -1.212383 -3.038072 -2.066477 -0.1944909 1.09053 -1.031548 -0.05811852 0.8806584 -0.4187592 0.8001156
19.42984 -0.8540133 -1.549696 -1.042291 -2.806957 -1.807048 -0.4086776 0.9826261 -0.8922153 0.4037202 0.5136248 -0.8887694 0.697401
19.134 -1.069223 -1.324763 -0.9380047 -2.575842 -1.53732 -0.7579905 0.9826261 -1.588878 0.8315937 0.6528239 -0.6102449 0.07483093
18.9491 -1.130711 -1.41909 -1.161475 -2.460284 -1.45255 -0.4220643 1.198433 -1.839677 0.2885236 0.07084489 -0.766915 0.2795147
18.9491 -1.084595 -1.426346 -1.268308 -2.358011 -1.517758 -0.06062418 1.39266 -1.97901 0.1897835 -0.3405299 -0.7843227 0.5114896
18.83817 -1.161456 -1.411834 -1.012495 -2.261606 -1.335177 -0.07401091 0.9178841 -1.867543 0.1733268 -0.2745811 -0.9409928 0.6631629
18.57931 -1.161456 -1.433602 -0.5357581 -2.105876 -1.387343 -0.8301595 0.7236576 -1.783944 0.6999404 0.03059217 -0.8539539 0.4796498
18.76421 -1.130711 -1.375555 -0.7443304 -2.283854 -1.465592 -0.7579905 1.133691 -1.839677 0.5518303 -0.3240427 -0.3143125 0.3932278
19.06004 -0.9923623 -1.201413 -1.042291 -2.489173 -1.66292 -0.7098777 1.068949 -1.561012 0.8809637 -0.1189179 -0.01692267 0.1157677
19.02306 -0.8693854 -1.092574 -1.116781 -2.720289 -1.691746 -0.676411 1.047368 -1.616745 0.765767 0.5193751 -0.3143125 -0.1719818
18.76421 -0.8386412 -1.215925 -0.9678007 -2.720289 -1.413426 -0.926385 1.09053 -1.839677 0.2391535 0.473372 -0.4535748 -0.1719818
18.46837 -0.8386412 -1.121598 -1.072087 -2.489173 -1.45255 -0.9985542 1.241595 -1.97901 -0.02415323 0.3411131 -0.4709826 0.08392798
18.32045 -0.8386412 -0.9619678 -1.305591 -2.269022 -1.634095 -0.3149709 0.4888108 -2.146209 0.05813013 0.1858526 -0.3839437 0.1976412
18.28347 -0.7310364 -1.012759 -1.268308 -2.604731 -1.159118 -0.2614243 -0.07101282 -2.508473 0.2885236 -0.06141403 -0.4709826 0.03844271
18.32045 -0.5619432 -1.150621 -1.286949 -2.431395 -1.439509 -0.1543309 -0.4552882 -2.006876 -0.02981412 0.1858526 -0.5057982 -0.0610655 ]
</NnetIo> <NnetIo> output <I1V> 8 <I1> 0 0 0 <I1> 0 1 0 <I1> 0 2 0 <I1> 0 3 0 <I1> 0 4 0 <I1> 0 5 0 <I1> 0 6 0 <I1> 0 7 0 rows=8 dim=856 [ 524 1 ] dim=856 [ 524 1 ] dim=856 [ 524 1 ] dim=856 [ 178 1 ] dim=856 [ 533 1 ] dim=856 [ 480 1 ] dim=856 [ 480 1 ] dim=856 [ 240 1 ]
Now let’s break down this big mess.
First we get info about the audio file this example comes from: libri-org_19-198-0024-1068
Then we get <NumIo> 2
, which tells us
We can find how many frames of audio are in this example: <I1V> 36
Then we see a long list of things like <I1> 0 1 0
, one tag for each of our 36 frames, where the middle number is the relative position to the central frame.
Then comes the audio features: 21.62141 -1.176828 -3.170161 -0.491064 -2.720289 -2.210604 0.3688199 1.155272 -0.03023905 -0.007696561 1.108493 0.08210284 0.2613205
one row of features for each frame.
After all the audio features comes info about the labels:
For our 36 audio frames, we have eight labels: <I1V> 8
, this has to do with splicing.
Next comes, like above, information about relative location of labels: <I1> 0 4 0
Given that these labels are one-hot vectors, they are represented as just their indeces: dim=856 [ 524 1 ] dim=856 [ 178 1 ]
the dim=856
reminds us how many senomes are in the model (hence dimensions in the target vectors. The
[ 524 1 ]` tell us the senome ID and the weight of the posterior. This is in my case trivial, because this translates to “the current frame has the label 524 with 100% certainty.”
Other Documentation
Here’s some documentation I wrote up as a walk-through of a typical Kaldi GMM run.sh script.
Here’s some more general Kaldi notes I’ve put together. They cover topics like HCLG
and file formats.