Getting started with the Merlin Speech Synthesis Toolkit
Introduction
This post is a short introduction to installing and using the Merlin Speech Synthesis toolkit.
In the following, I will display all the commands needed to (1) install Merlin from the official GitHub repository as well as (2) run the included demo.
In addition to showing and explaining the commands needed to install and run Merlin, I will also take some time to show how the scripts work and dive into the file structure expected by Merlin. However, I won’t get into any of the algorithms behind DNNs or speech synthesis.
More specifically, I will show which files are required by scripts, which files are generated by scripts, and how the main demo script run_demo.sh
proceeds from data preparation
to training
to synthesis
.
Here is a nice, concise description of the toolkit quoted directly from the official CSTR Merlin site:
Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It must be used in combination with a front-end text processor (e.g., Festival) and a vocoder (e.g., STRAIGHT or WORLD).
The system is written in Python and relies on the Theano numerical computation library.
Merlin comes with recipes (in the spirit of the Kaldi automatic speech recognition toolkit) to show you how to build state-of-the art systems.
Merlin is free software, distributed under an Apache License Version 2.0, allowing unrestricted commercial and non-commercial use alike.
Installation
Clone
Like good open-source software, the Merlin toolkit is hosted on GitHub and can be easily downloaded (cloned) with a single line of code:
josh@yoga:~/git$ git clone https://github.com/CSTR-Edinburgh/merlin.git
Clonage dans 'merlin'...
remote: Counting objects: 1515, done.
remote: Total 1515 (delta 0), reused 0 (delta 0), pack-reused 1514
Réception d'objets: 100% (1515/1515), 5.06 MiB | 682.00 KiB/s, done.
Résolution des deltas: 100% (734/734), done.
Vérification de la connectivité... fait.
Easy as that! Now let’s take a peek into what we just downloaded.
josh@yoga:~/git$ cd merlin/
josh@yoga:~/git/merlin$ ls
COPYING CREDITS.md doc egs INSTALL LICENSE misc README.md src test tools
We can immediately see some familiar file and directory names, like doc
for documentation, egs
for examples, src
for source code, etc.
First things first, we should always start investigating new code with the README
file:
josh@yoga:~/git/merlin$ cat README.md
The Neural Network (NN) based Speech Synthesis System
=====================================================
This repository contains the Neural Network (NN) based Speech Synthesis System
developed at the Centre for Speech Technology Research (CSTR), University of
Edinburgh.
To build the toolkit: see `./INSTALL`. These instructions are valid for UNIX
systems including various flavors of Linux;
To run the example system builds, see `egs/README.txt`
As a first demo, please follow the scripts in `egs/slt_arctic`
Synthetic speech samples
------------------------
Listen to [synthetic speech samples](https://cstr-edinburgh.github.io/merlin/demo.html) from our demo voice.
Development pattern for contributors
------------------------------------
1. [Create a personal fork](https://help.github.com/articles/fork-a-repo/)
of the [main Merlin repository] (https://github.com/CSTR-Edinburgh/merlin) in GitHub.
2. Make your changes in a named branch different from `master`, e.g. you create
a branch `my-new-feature`.
3. [Generate a pull request](https://help.github.com/articles/creating-a-pull-request/)
through the Web interface of GitHub.
Contact Us
----------
Post your questions, suggestions, and discussions to [GitHub Issues](https://github.com/CSTR-Edinburgh/merlin/issues).
Citation
--------
If you publish work based on Merlin, please cite:
Zhizheng Wu, Oliver Watts, Simon King, "Merlin: An Open Source Neural Network Speech Synthesis System" in Proc. 9th ISCA Speech Synthesis Workshop (SSW9), September 2016, Sunnyvale, CA, USA.
Srikanth Ronanki, Zhizheng Wu, Oliver Watts, Simon King, "A Demonstration of the Merlin Open Source Neural Network Speech Synthesis System" in Proc. special demo session, 9th ISCA Speech Synthesis Workshop (SSW9), 2016, Sunnyvale, CA, USA.
This main README
file gives us information on the toolkit, how to install it, how to work with it, and how to contact the team from the University of Edinburgh. For our current purposes, since we’re just trying to install and run the demo, let’s take a look into the INSTALL
file.
josh@yoga:~/git/merlin$ cat ./INSTALL
INSTALL
=======
(1) go to tools/ and follow INSTALL instructions there.
(2) Merlin is coded in python and need third-party python libraries such as:
numpy, scipy, matplotlib, lxml
Usually shipped with your python packages
Available in Ubuntu packages
theano
Can be found on pip
Need version 0.6 and above
http://deeplearning.net/software/theano/
bandmat
Can be found on pip
https://pypi.python.org/pypi/bandmat
For running on NVIDIA GPU, you will need also CUDA
https://developer.nvidia.com/cuda-zone
and you might want also CUDNN [optionnal]
https://developer.nvidia.com/cudnn
Computationnal efficiency is obviously greatly improved using GPU.
It is also improved using the latest versions of theano.
So, we find that to install Merlin succesfully, we have to:
- Follow the instructions in the
tools/INSTALL
file - Install some Python dependencies
Compile
First we cd
into the tools
dir:
josh@yoga:~/git/merlin$ cd tools/
josh@yoga:~/git/merlin/tools$ ls
compile_tools.sh INSTALL WORLD WORLD_v2
Then we take a look into the new INSTALL
file we find here:
josh@yoga:~/git/merlin/tools$ cat ./INSTALL
INSTALL
=======
./compile_tools.sh
Simple as that! We have to just run that one compile_tools.sh
script and we should be good to go (apart from Python dependencies).
So, when we run that script, we see something like this:
josh@yoga:~/git/merlin/tools$ ./compile_tools.sh
downloading SPTK-3.9...
--2017-02-14 10:31:33-- http://downloads.sourceforge.net/sp-tk/SPTK-3.9.tar.gz
Résolution de downloads.sourceforge.net (downloads.sourceforge.net)… 216.34.181.59
Connexion à downloads.sourceforge.net (downloads.sourceforge.net)|216.34.181.59|:80… connecté.
requête HTTP transmise, en attente de la réponse… 301 Moved Permanently
Emplacement : http://downloads.sourceforge.net/project/sp-tk/SPTK/SPTK-3.9/SPTK-3.9.tar.gz [suivant]
--2017-02-14 10:31:34-- http://downloads.sourceforge.net/project/sp-tk/SPTK/SPTK-3.9/SPTK-3.9.tar.gz
Réutilisation de la connexion existante à downloads.sourceforge.net:80.
requête HTTP transmise, en attente de la réponse… 302 Moved Temporarily
Emplacement : https://netix.dl.sourceforge.net/project/sp-tk/SPTK/SPTK-3.9/SPTK-3.9.tar.gz [suivant]
--2017-02-14 10:31:34-- https://netix.dl.sourceforge.net/project/sp-tk/SPTK/SPTK-3.9/SPTK-3.9.tar.gz
Résolution de netix.dl.sourceforge.net (netix.dl.sourceforge.net)… 87.121.121.2
Connexion à netix.dl.sourceforge.net (netix.dl.sourceforge.net)|87.121.121.2|:443… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 1077702 (1.0M) [application/x-gzip]
Enregistre : «SPTK-3.9.tar.gz.1»
100%[===========================================================================================================================================================================>] 1,077,702 262KB/s ds 4.0s
2017-02-14 10:31:39 (262 KB/s) - «SPTK-3.9.tar.gz.1» enregistré [1077702/1077702]
compiling SPTK...
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking whether gcc understands -c and -o together... yes
checking for style of include used by make... GNU
.
.
.
test/audioio.cpp:92:38: warning: ignoring return value of ‘size_t fread(void*, size_t, size_t, FILE*)’, declared with attribute warn_unused_result [-Wunused-result]
fread(&data_check[1], 1, 3, fp);
^
test/audioio.cpp:104:34: warning: ignoring return value of ‘size_t fread(void*, size_t, size_t, FILE*)’, declared with attribute warn_unused_result [-Wunused-result]
fread(for_int_number, 1, 4, fp); // "data"
^
mkdir -p ./build/objs/test
g++ -O1 -g -Wall -fPIC -Isrc -o "build/objs/test/analysis.o" -c "test/analysis.cpp"
g++ -O1 -g -Wall -fPIC -o ./build/analysis ./build/objs/test/audioio.o ./build/objs/test/analysis.o ./build/libworld.a -lm
mkdir -p ./build/objs/test
g++ -O1 -g -Wall -fPIC -Isrc -o "build/objs/test/synth.o" -c "test/synth.cpp"
test/synth.cpp: In function ‘void {anonymous}::WaveformSynthesis(WorldParameters*, int, int, double*)’:
test/synth.cpp:240:9: warning: variable ‘elapsed_time’ set but not used [-Wunused-but-set-variable]
DWORD elapsed_time;
^
test/synth.cpp: In function ‘int main(int, char**)’:
test/synth.cpp:313:55: warning: ignoring return value of ‘size_t fread(void*, size_t, size_t, FILE*)’, declared with attribute warn_unused_result [-Wunused-result]
fread(&world_parameters.f0[i], sizeof(double), 1, fp);
^
test/synth.cpp:332:72: warning: ignoring return value of ‘size_t fread(void*, size_t, size_t, FILE*)’, declared with attribute warn_unused_result [-Wunused-result]
fread(&world_parameters.spectrogram[i][j], sizeof(double), 1, fp);
^
test/synth.cpp:341:65: warning: ignoring return value of ‘size_t fread(void*, size_t, size_t, FILE*)’, declared with attribute warn_unused_result [-Wunused-result]
fread(&coarse_aperiodicities[i][j], sizeof(double), 1, fp);
^
g++ -O1 -g -Wall -fPIC -o ./build/synth ./build/objs/test/audioio.o ./build/objs/test/synth.o ./build/libworld.a -lm
Removing all temporary binaries...
Done.
All tools successfully compiled!!
Since the output to the terminal was pretty long, I cut out a big part in the middle as you can see.
Install Python Dependencies
So now that we have compiled our Merlin tools, let’s move on to the second step in the main merlin/INSTALL
file where we are told to make sure we have the right Python dependencies installed. It’s not explicit, but I’m pretty sure we need Python2 and not Python3. In any case, I’ve tested this out with Python2 on my system and it seems to work ok.
So, let’s get all the dependencies in one fell swoop with the trusty pip
program. This program helps us install Python packages securely and easily. You probably already have it installed, but if not, follow the simple instructions here.
We can install all the dependencies with one command:
josh@yoga:~/git/merlin$ pip install numpy scipy matplotlib lxml theano bandmat
Since I had all these packages installed already, I get the following messages to the terminal:
josh@yoga:~$ pip install numpy scipy matplotlib lxml theano bandmat
Requirement already satisfied (use --upgrade to upgrade): numpy in /usr/local/lib/python2.7/dist-packages
Requirement already satisfied (use --upgrade to upgrade): scipy in /usr/lib/python2.7/dist-packages
Requirement already satisfied (use --upgrade to upgrade): matplotlib in /usr/lib/pymodules/python2.7
Requirement already satisfied (use --upgrade to upgrade): lxml in /usr/lib/python2.7/dist-packages
Requirement already satisfied (use --upgrade to upgrade): theano in /usr/local/lib/python2.7/dist-packages
Requirement already satisfied (use --upgrade to upgrade): bandmat in /usr/local/lib/python2.7/dist-packages
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in /usr/lib/python2.7/dist-packages (from matplotlib)
Requirement already satisfied (use --upgrade to upgrade): tornado in /usr/local/lib/python2.7/dist-packages (from matplotlib)
Requirement already satisfied (use --upgrade to upgrade): pyparsing>=1.5.6 in /usr/lib/python2.7/dist-packages (from matplotlib)
Requirement already satisfied (use --upgrade to upgrade): nose in /usr/lib/python2.7/dist-packages (from matplotlib)
Requirement already satisfied (use --upgrade to upgrade): six>=1.9.0 in /usr/local/lib/python2.7/dist-packages (from theano)
Requirement already satisfied (use --upgrade to upgrade): backports.ssl-match-hostname in /usr/local/lib/python2.7/dist-packages (from tornado->matplotlib)
Requirement already satisfied (use --upgrade to upgrade): singledispatch in /usr/local/lib/python2.7/dist-packages (from tornado->matplotlib)
Requirement already satisfied (use --upgrade to upgrade): certifi in /usr/local/lib/python2.7/dist-packages (from tornado->matplotlib)
Requirement already satisfied (use --upgrade to upgrade): backports-abc>=0.4 in /usr/local/lib/python2.7/dist-packages (from tornado->matplotlib)
Cleaning up...
So now, at this point we have downloaded and compiled Merlin along with all its necessary dependencies!
Now let’s go on to run the demo.
Run the Demo
As per the main merlin/README.md
file, we are invited to begin our demo with example slt_arctic
.
So, let’s cd
into that directory and take a look around.
josh@yoga:~/git/merlin$ cd egs/slt_arctic/s1/
josh@yoga:~/git/merlin/egs/slt_arctic/s1$ ls
conf merlin_synthesis.sh README.md RESULTS.md run_demo.sh run_full_voice.sh scripts testrefs
First things first, let’s look at what the README
has to tell us about this example.
josh@yoga:~/git/merlin/egs/slt_arctic/s1$ cat README.md
Demo voice
----------
To run demo voice, please follow below steps:
Step 1: git clone https://github.com/CSTR-Edinburgh/merlin.git <br/>
Step 2: cd merlin/egs/slt_arctic/s1 <br/>
Step 3: ./run_demo.sh
Demo voice trains only on 50 utterances and shouldnt take more than 5 min.
Compare the results in log files to baseline results from demo data in [RESULTS.md](https://github.com/CSTR-Edinburgh/merlin/blob/master/egs/slt_arctic/s1/RESULTS.md)
Full voice
----------
To run full voice, please follow below steps:
Step 1: git clone https://github.com/CSTR-Edinburgh/merlin.git <br/>
Step 2: cd merlin/egs/slt_arctic/s1 <br/>
Step 3: ./run_full_voice.sh
Full voice utilizes the whole arctic data (1132 utterances). The training of the voice approximately takes 1 to 2 hours.
Compare the results in log files to baseline results from full data in [RESULTS.md](https://github.com/CSTR-Edinburgh/merlin/blob/master/egs/slt_arctic/s1/RESULTS.md)
Generate new sentences
----------------------
To generate new sentences, please follow below steps:
Step 1: Run either demo voice or full voice. <br/>
Step 2: ./merlin_synthesis.sh
So, we’re going to start with the simpler demo voice
as suggested.
Pre-existing Dirs & Files
Before we just go and run that run_demo.sh
, lets investigate what data and scripts we have so far so that we can get an idea of what Merlin requires and what the workflow is.
I always like to use the tree
program for Linux for looking at file structures. It’s a very simple program that will list the contents of a directory recursively.
When we look at the file structure for the slt_arctic/s1
dir, before we’ve run any scripts, we see this:
josh@yoga:~/git/merlin/egs/slt_arctic$ tree s1/
s1/
├── conf
│ ├── dnn
│ │ ├── acoustic_slt_arctic_full.conf
│ │ ├── duration_slt_arctic_full.conf
│ │ ├── test_dur_synth_slt_arctic_full.conf
│ │ └── test_synth_slt_arctic_full.conf
│ └── logging_config.conf
├── merlin_synthesis.sh
├── README.md
├── RESULTS.md
├── run_demo.sh
├── run_full_voice.sh
├── scripts
│ ├── prepare_config_files_for_synthesis.sh
│ ├── prepare_config_files.sh
│ ├── prepare_labels_from_txt.sh
│ ├── remove_intermediate_files.sh
│ ├── setup.sh
│ └── submit.sh
└── testrefs
├── slt_arctic_demo
│ ├── acoustic_model
│ │ └── log
│ │ └── DNN_TANH_TANH_TANH_TANH_LINEAR__mgc_lf0_vuv_bap_50_259_4_512_0.002000_04_59PM_August_23_2016.log
│ └── duration_model
│ └── log
│ └── DNN_TANH_TANH_TANH_TANH_LINEAR__dur_50_259_4_512_0.002000_04_57PM_August_23_2016.log
└── slt_arctic_full
├── acoustic_model
│ └── log
│ └── DNN_TANH_TANH_TANH_TANH_LINEAR__mgc_lf0_vuv_bap_1000_259_4_512_0.002000_08_47PM_August_30_2016.log
└── duration_model
└── log
└── DNN_TANH_TANH_TANH_TANH_LINEAR__dur_1000_259_4_512_0.002000_08_44PM_August_30_2016.log
14 directories, 20 files
So, what should we take away from the files and directories present?
conf
The first dir we find in slt_arctic/s1
is conf
(short for configuration).
The conf
dir contains one configuration file for logging (logging_config.conf
), as well as one dir (dnn
) which contains DNN configuration files.
In the conf/dnn
dir, the DNN “conf” files define some information about the paths to relevant directories, information about the training data, and the architecture of the DNNs we want to train.
There are a total of four (4) DNN configuration files, the first two for training and the last two for testing:
josh@yoga:~/git/merlin/egs/slt_arctic/s1/conf$ tree dnn/
dnn/
├── acoustic_slt_arctic_full.conf
├── duration_slt_arctic_full.conf
├── test_dur_synth_slt_arctic_full.conf
└── test_synth_slt_arctic_full.conf
0 directories, 4 files
In Merlin, we don’t just model the acoustics of phonemes of the language. We also model their durations. For both (1) acoustic modeling and (2) duration modeling, we use DNNs, and as such we have two configuration files as seen in the conf/dnn
dir above.
Quoting from the team’s demonstration paper, they concisely describe the duration model as such:
Duration modelling Merlin models duration using a separate DNN to the acoustic model. The duration model is trained on the aligned data, to predict phone- and/or state-level durations. At synthesis time, duration is predicted first, and is used as an input to the acoustic model to predict the speech parameters.
As such, we need two DNNs, and that means we need two configuration files.
scripts
Moving onto the second directory within s1
, we find the location of our data preparation scripts.
Logically, this directory is labeled scripts
.
Let’s take a look inside:
josh@yoga:~/git/merlin/egs/slt_arctic/s1$ tree scripts/
scripts/
├── prepare_config_files_for_synthesis.sh
├── prepare_config_files.sh
├── prepare_labels_from_txt.sh
├── remove_intermediate_files.sh
├── setup.sh
└── submit.sh
0 directories, 6 files
The first four data preparation scripts in the scripts
dir have very transparent filenames, so I won’t elaborate on them.
However, we should note that the main purpose of the setup.sh
script is to:
- download the demo data
- create the main directories to house the current experiment
- move test and train data into those directories
- define the global configuration file
Moving on, the last located in the s1/scripts/
dir is the submit.sh
script. The name is transparent once you know that this script will take any Theano job and submit it to either a GPU or CPU, depending on what you have available.
testrefs
Moving back up a level to the s1/
dir, the last of the three main dirs is testrefs
. This dir contains only four (4) files, which are all log files from training performed by the CSTR team. These files can be used to compare against our own training in case we hit any problems.
josh@yoga:~/git/merlin/egs/slt_arctic/s1$ tree testrefs/
testrefs/
├── slt_arctic_demo
│ ├── acoustic_model
│ │ └── log
│ │ └── DNN_TANH_TANH_TANH_TANH_LINEAR__mgc_lf0_vuv_bap_50_259_4_512_0.002000_04_59PM_August_23_2016.log
│ └── duration_model
│ └── log
│ └── DNN_TANH_TANH_TANH_TANH_LINEAR__dur_50_259_4_512_0.002000_04_57PM_August_23_2016.log
└── slt_arctic_full
├── acoustic_model
│ └── log
│ └── DNN_TANH_TANH_TANH_TANH_LINEAR__mgc_lf0_vuv_bap_1000_259_4_512_0.002000_08_47PM_August_30_2016.log
└── duration_model
└── log
└── DNN_TANH_TANH_TANH_TANH_LINEAR__dur_1000_259_4_512_0.002000_08_44PM_August_30_2016.log
6 directories, 4 files
At this point, we’ve gone over the content in each of the three dirs in the first level of the main s1
dir.
Specifically, we mentioned:
s1/conf
: contains configuration files for building, training, and testing our DNNss1/scripts
: contains scripts for preparing data and submitting Theano jobss1/testrefs
: contains log files from the CSTR team for our reference
Run the run_demo.sh
Script
Now that we’ve gone over our pre-existing dirs, let’s go on to the run script and start running the demo.
The s1/README.md
file directs us to run the run_demo.sh
script first, so I will start there.
So, walking through the run_demo.sh
script, the first thing we see is three scripts in the data prep stage:
### Step 1: setup directories and the training data files ###
echo "Step 1: setting up experiments directory and the training data files..."
global_config_file=conf/global_settings.cfg
./scripts/setup.sh slt_arctic_demo
./scripts/prepare_config_files.sh $global_config_file
./scripts/prepare_config_files_for_synthesis.sh $global_config_file
If we take the time to run these scripts one-by-one, we can more easily see what’s going on. I’m going to run each script here, and look into what’s being output by each one.
Prepare Data & Create Config File
setup.sh
does two important things:
- download, unzip, and move training and testing data into new
experiments
dir - create global configuration file,
global_settings.cfg
, in ourconf
dir
So, beginning by running only the setup.sh
script (from within run_demo.sh
), we get the following output to the terminal:
josh@yoga:~/git/merlin/egs/slt_arctic/s1$ ./run_demo.sh
Step 1: setting up experiments directory and the training data files...
downloading data.....
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 12.9M 100 12.9M 0 0 2856k 0 0:00:04 0:00:04 --:--:-- 3181k
unzipping files......
data is ready!
Merlin default voice settings configured in conf/global_settings.cfg
setup done...!
After running setup.sh
, if we look into the file structure of the s1
dir (for just the first level, hence the -L 1
flag), we can see we’ve created a few new things.
I’ve put asterisks (**) around the newly created dirs and file:
josh@yoga:~/git/merlin/egs/slt_arctic/s1$ tree -L 1 .
.
├── conf
├── **experiments**
├── merlin_synthesis.sh
├── README.md
├── RESULTS.md
├── run_demo.sh
├── run_full_voice.sh
├── scripts
├── **slt_arctic_demo_data**
├── **slt_arctic_demo_data.zip**
└── testrefs
5 directories, 6 files
Specifically, we have created the following:
experiments
dirslt_arctic_demo_data
dirslt_arctic_demo_data.zip
compressed file
What happened was the following:
- the relevant data files were downloaded from Srikanth Ronanki’s homepage, hence the
slt_arctic_demo_data.zip
file - this compressed file was uncompressed and saved as
slt_arctic_demo_data
- the relevant files were copied from
slt_arctic_demo_data
into the newexperiments
directory
As such, at this point the experiments
dir contains mostly just extracted audio feature files.
Here’s what we find in this new experiments
dir:
josh@yoga:~/git/merlin/egs/slt_arctic/s1$ tree experiments/
experiments/
└── slt_arctic_demo
├── acoustic_model
│ └── data
│ ├── bap
│ │ └── arctic_a0001.bap
│ ├── file_id_list_demo.scp
│ ├── label_phone_align
│ │ └── arctic_a0001.lab
│ ├── label_state_align
│ │ └── arctic_a0001.lab
│ ├── lf0
│ │ └── arctic_a0001.lf0
│ └── mgc
│ └── arctic_a0001.mgc
├── duration_model
│ └── data
│ ├── file_id_list_demo.scp
│ ├── label_phone_align
│ │ └── arctic_a0001.lab
│ └── label_state_align
│ └── arctic_a0001.lab
└── test_synthesis
├── prompt-lab
│ └── arctic_a0001.lab
└── test_id_list.scp
14 directories, 433 files
In the above output I’ve omitted displaying all feature files except for arctic_a0001
because there’s a lot of them.
Specifically, there are 430 feature files in total.
Since there’s a lot of stuff going on in the experiments
dir, I think it’s worth the time to briefly explain what we have downloaded in terms of audio feature, label, and utterance ID files.
File Formats in experiments
Dir
In terms of the file formats, we find the following:
*.bap
: band a-periodicities*.lab
: label files (time-to-phone alignments)*.lf0
: log-fundamental frequencies*.mgc
: generalized cepstral coefficients*.scp
: script file for filenames
*.bap
The first file type, *.bap
, is a kind of feature extracted from the audio, and we have one file for every audio file in our data set. We look into the *.bap
file itself with the help of the x2x
program which is included in Merlin via SPTK
(Speech Signal Processing Toolkit). In the following command we use +fa
to convert our binary floats (“f”) to ASCII (“a”).
The only reason I’m using head
and tail
here is to get a section of the file which has numbers other than zeros, so it’s more interesting.
josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo/acoustic_model/data/bap$ ../../../../../../../../tools/bin/SPTK-3.9/x2x +fa arctic_a0001.bap | head -50 | tail -10
0
0
0
0
-1.74326
-1.80508
-7.27617
-6.94787
-3.88911
-5.94744
*.lab
The second file type in the experiments
dir is *.lab
. These files are the “label” files which contain alignments for either phones or states to our audio files from our data set. We have two kinds of files here: (1) for phoneme alignments, and (2) for state alignments.
Quoting from email communication with Srikanth Ronanki (one of the Merlin developers):
To train a voice using Merlin, you either require phone-level or state-level alignments but not both. This option is configurable in
global_settings.cfg
.Only very few international languages have enough resources to derive state-level alignments using HMMs – therefore Merlin is configured to even run with phone-level alignments which can be manually-labelled. But, the default demo runs using state-level alignments.
First, for phoneme alignments, we see something like this:
josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo/acoustic_model/data/label_phone_align$ head arctic_a0001.lab
0 2050000 x^x-sil+sil=ao@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:0+0+0/D:0_0/E:x+x@x+x&x+x#x+x/F:0_0/G:0_0/H:x=x@1=2|0/I:0=0/J:14+8-2
2050000 3400000 sil^sil-ao+th=er@1_2/A:0_0_0/B:1-1-2@1-2&1-7#1-4$1-3!0-2;0-4|ao/C:0+0+1/D:0_0/E:content+2@1+5&1+2#0+3/F:in_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2
3400000 4650000 sil^ao-th+er=ah@2_1/A:0_0_0/B:1-1-2@1-2&1-7#1-4$1-3!0-2;0-4|ao/C:0+0+1/D:0_0/E:content+2@1+5&1+2#0+3/F:in_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2
4650000 5950000 ao^th-er+ah=v@1_1/A:1_1_2/B:0-0-1@2-1&2-6#1-4$1-3!1-1;1-3|er/C:1+0+2/D:0_0/E:content+2@1+5&1+2#0+3/F:in_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2
5950000 6650000 th^er-ah+v=dh@1_2/A:0_0_1/B:1-0-2@1-1&3-5#1-3$1-3!2-2;2-2|ah/C:0+0+2/D:content_2/E:in+1@2+4&2+2#1+2/F:det_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2
6650000 7650000 er^ah-v+dh=ax@2_1/A:0_0_1/B:1-0-2@1-1&3-5#1-3$1-3!2-2;2-2|ah/C:0+0+2/D:content_2/E:in+1@2+4&2+2#1+2/F:det_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2
7650000 8200000 ah^v-dh+ax=d@1_2/A:1_0_2/B:0-0-2@1-1&4-4#2-3$1-3!1-1;3-1|ax/C:1+1+4/D:in_1/E:det+1@3+3&2+2#2+1/F:content_2/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2
8200000 8500000 v^dh-ax+d=ey@2_1/A:1_0_2/B:0-0-2@1-1&4-4#2-3$1-3!1-1;3-1|ax/C:1+1+4/D:in_1/E:det+1@3+3&2+2#2+1/F:content_2/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2
8500000 9450000 dh^ax-d+ey=n@1_4/A:0_0_2/B:1-1-4@1-2&5-3#2-2$1-2!2-2;4-2|ey/C:0+0+1/D:det_1/E:content+2@4+2&2+1#3+1/F:content_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2
9450000 10450000 ax^d-ey+n=jh@2_3/A:0_0_2/B:1-1-4@1-2&5-3#2-2$1-2!2-2;4-2|ey/C:0+0+1/D:det_1/E:content+2@4+2&2+1#3+1/F:content_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2
For state-level alignments, we get the following:
josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo/acoustic_model/data/label_state_align$ head arctic_a0001.lab
0 50000 x^x-sil+sil=ao@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:0+0+0/D:0_0/E:x+x@x+x&x+x#x+x/F:0_0/G:0_0/H:x=x@1=2|0/I:0=0/J:14+8-2[2]
50000 100000 x^x-sil+sil=ao@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:0+0+0/D:0_0/E:x+x@x+x&x+x#x+x/F:0_0/G:0_0/H:x=x@1=2|0/I:0=0/J:14+8-2[3]
100000 150000 x^x-sil+sil=ao@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:0+0+0/D:0_0/E:x+x@x+x&x+x#x+x/F:0_0/G:0_0/H:x=x@1=2|0/I:0=0/J:14+8-2[4]
150000 1700000 x^x-sil+sil=ao@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:0+0+0/D:0_0/E:x+x@x+x&x+x#x+x/F:0_0/G:0_0/H:x=x@1=2|0/I:0=0/J:14+8-2[5]
1700000 2050000 x^x-sil+sil=ao@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:0+0+0/D:0_0/E:x+x@x+x&x+x#x+x/F:0_0/G:0_0/H:x=x@1=2|0/I:0=0/J:14+8-2[6]
2050000 2400000 sil^sil-ao+th=er@1_2/A:0_0_0/B:1-1-2@1-2&1-7#1-4$1-3!0-2;0-4|ao/C:0+0+1/D:0_0/E:content+2@1+5&1+2#0+3/F:in_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2[2]
2400000 2550000 sil^sil-ao+th=er@1_2/A:0_0_0/B:1-1-2@1-2&1-7#1-4$1-3!0-2;0-4|ao/C:0+0+1/D:0_0/E:content+2@1+5&1+2#0+3/F:in_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2[3]
2550000 2650000 sil^sil-ao+th=er@1_2/A:0_0_0/B:1-1-2@1-2&1-7#1-4$1-3!0-2;0-4|ao/C:0+0+1/D:0_0/E:content+2@1+5&1+2#0+3/F:in_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2[4]
2650000 2700000 sil^sil-ao+th=er@1_2/A:0_0_0/B:1-1-2@1-2&1-7#1-4$1-3!0-2;0-4|ao/C:0+0+1/D:0_0/E:content+2@1+5&1+2#0+3/F:in_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2[5]
2700000 3400000 sil^sil-ao+th=er@1_2/A:0_0_0/B:1-1-2@1-2&1-7#1-4$1-3!0-2;0-4|ao/C:0+0+1/D:0_0/E:content+2@1+5&1+2#0+3/F:in_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2[6]
For a little sanity check, we can see that the two *.lab
files for the same audio file have a different number of lines.
Each line in a *.lab
file represents a single alignment in time. As such, we would expect that for a given audio file, its state
alignment .lab
file will have more lines than its phoneme
.lab
file. This is because one phoneme is made up of multiple states in HMM-based ASR (ie. the machinery used to generate alignments).
We’d have to look more into the details of the acoustic model used to do forced alignment to generate the labels, but the difference between number of lines in the alignment files are around the expected numbers:
josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo/acoustic_model/data$ wc -l label_phone_align/arctic_a0001.lab
37 label_phone_align/arctic_a0001.lab
josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo/acoustic_model/data$ wc -l label_state_align/arctic_a0001.lab
185 label_state_align/arctic_a0001.lab
In this case, we have for the audio file arctic_a0001.lab
a total of 37 phonemes and 185 states. This comes out to 185/37 = 5
states per phoneme.
*.lf0
The *.lf0
files are the log-fundamental frequency files, aka, another kind of feature file extracted from our audio files in our data set.
Again here, we’re using the x2x
program to get something human-readable from the feature file:
josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo/acoustic_model/data/lf0$ ../../../../../../../../tools/bin/SPTK-3.9/x2x +fa arctic_a0001.lf0 | head -50 | tail -10
-1e+10
-1e+10
-1e+10
-1e+10
5.63102
5.70239
5.6373
5.46527
5.46528
5.45657
I’ve deleted a bunch of empty lines in the above output so as not to take up so much space.
*.mgc
Next, we move onto our next feature file type: *.mgc
. These files contain the generalized cepstral coefficients for our audio files in our data set. Again, this is not very human readable:
josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo/acoustic_model/data/mgc$ ../../../../../../../../tools/bin/SPTK-3.9/x2x +fa arctic_a0001.mgc | head -50 | tail -10
-0.0696582
-0.0268611
0.0244125
0.038228
-0.0542025
0.0157833
0.0227506
-0.0430836
0.0407014
-0.0229975
*.scp
Moving on to the next file type, the *.scp
files here, I’m guessing, are similar to Kaldi’s *.scp
files. These files are “script” files which typically contain lists of information. In this case, the experiments
dir contains two *.scp
files which contain lists of file ids. For example:
josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo/acoustic_model/data$ head file_id_list_demo.scp
arctic_a0001
arctic_a0002
arctic_a0003
arctic_a0004
arctic_a0005
arctic_a0006
arctic_a0007
arctic_a0008
arctic_a0009
arctic_a0010
Now that we’ve gone through the experiments
dir and the format of its file contents, here’s a short recap of where we are:
We’ve just run the
setup.sh
script. This script downloaded, unzipped, and formatted the data we need for training and testing. The data we need includes (1) various audio feature files, (2) label files, and (3) utterance lists.
global_settings.cfg
We shouldn’t forget that the setup.sh
script created a new, global configuration file in the conf
dir.
This new configuration file is called global_settings.cfg
.
josh@yoga:~/git/merlin/egs/slt_arctic/s1$ tree conf/
conf/
├── dnn
│ ├── acoustic_slt_arctic_full.conf
│ ├── duration_slt_arctic_full.conf
│ ├── test_dur_synth_slt_arctic_full.conf
│ └── test_synth_slt_arctic_full.conf
├── **global_settings.cfg**
└── logging_config.conf
1 directory, 6 files
This global config file (conf/global_settings.cfg
) will contain information on where the compiled Merlin programs are located, where the current working dir is, what kind of Vocoder we’re using, and how many files to use for training and testing.
That’s all for setup.sh
!
Prepare Training Config Files
Now we’re going to run just the prepare_config_files.sh
script from within run_demo.sh
, just like we did above with setup.sh
), and take a look at what it does.
This script produces two configuration files for training:
duration_slt_arctic_demo.conf
acoustic_slt_arctic_demo.conf
When we run the script, we only get two lines of output to the terminal:
josh@yoga:~/git/merlin/egs/slt_arctic/s1$ ./run_demo.sh
Duration configuration settings stored in conf/duration_slt_arctic_demo.conf
Acoustic configuration settings stored in conf/acoustic_slt_arctic_demo.conf
Sure enough, as promised by our two above messages, when we look into our pre-existing conf
directory, we find two new configuration files:
josh@yoga:~/git/merlin/egs/slt_arctic/s1$ tree conf/
conf/
├── **acoustic_slt_arctic_demo.conf**
├── dnn
│ ├── acoustic_slt_arctic_full.conf
│ ├── duration_slt_arctic_full.conf
│ ├── test_dur_synth_slt_arctic_full.conf
│ └── test_synth_slt_arctic_full.conf
├── **duration_slt_arctic_demo.conf**
├── global_settings.cfg
└── logging_config.conf
1 directory, 8 files
These two new configuration files are of the same form as the original configuration files we found pre-existing when we cloned Merlin from GitHub.
They define architecture and training procedure for the acoustic model DNN and the duration model DNN.
Pretty straightforward overall.
Prepare Testing Config Files
This section deals with the prepare_config_files_for_synthesis.sh
script called from run_demo.sh
.
In the previous script we created the configuration files for our training procedure, now we do the same for our testing (aka aynthesis) phase.
Just like before, we get two lines of output to the terminal.
josh@yoga:~/git/merlin/egs/slt_arctic/s1$ ./run_demo.sh
Duration configuration settings stored in conf/test_dur_synth_slt_arctic_demo.conf
Acoustic configuration settings stored in conf/test_synth_slt_arctic_demo.conf
Also, just like before, we have two new files in our conf
dir:
josh@yoga:~/git/merlin/egs/slt_arctic/s1$ tree conf/
conf/
├── acoustic_slt_arctic_demo.conf
├── dnn
│ ├── acoustic_slt_arctic_full.conf
│ ├── duration_slt_arctic_full.conf
│ ├── test_dur_synth_slt_arctic_full.conf
│ └── test_synth_slt_arctic_full.conf
├── duration_slt_arctic_demo.conf
├── global_settings.cfg
├── logging_config.conf
├── **test_dur_synth_slt_arctic_demo.conf**
└── **test_synth_slt_arctic_demo.conf**
1 directory, 10 files
Train the Duration Model
If you’ve gotten here and followed all the previous steps, we’re ready to finally start training our DNNs… huzzah!
You’ll see in the run_demo.sh
script that after the data preparation phase is completed, we send a Theano job to our processing unit (CPU or GPU) and we specify that we want to train the duration model:
### Step 2: train duration model ###
echo "Step 2: training duration model..."
./scripts/submit.sh ${MerlinDir}/src/run_merlin.py conf/duration_${Voice}.conf
As you can see, what happens here is that the main Merlin run script run_merlin.py
is called and given its argument, duration_${Voice}.conf
, a configuration file for the duration DNN model.
This run_merlin.py
script is then fed into another script, submit.py
where it is processed as a job and sent to your computer’s CPU(s) or GPU(s).
As such, the main script of interest in this training step is run_merlin.py
.
This is a very clearly written Python script, and I invite you to take a look yourself into the code. Here I will merely show you the main train function and its arguments and defaults:
def train_DNN(train_xy_file_list, # training file list
valid_xy_file_list, # validation file list
nnets_file_name, # filename for DNN we save to disk
n_ins, # input feature dimensionality
n_outs, # output feature dimensionality
ms_outs, # multistream_outs
hyper_params, # hyperparameters for training and architecture
buffer_size, # training buffer size
plot=False, # create plot of (train/dev) training convergence
var_dict=None, # load covariance matrix
cmp_mean_vector = None, # cmp == audio features used in HTS training
cmp_std_vector = None, # cmp == audio features used in HTS training
init_dnn_model_file = None): # DNN model with which we initialize new DNN
Since this step in run_demo.sh
produces a lot of output, I decided to record my terminal session and embed it here, for those interested in the details.
After the training is complete, we can see that we’ve produced some new dirs and files in our experiments/duration_model
directory:
Here’s our duration_model
dir before training:
josh@yoga:~/git/merlin/egs/slt_arctic/s1$ tree -d experiments/
experiments/
└── slt_arctic_demo
└── duration_model
└── data
├── file_id_list_demo.scp
├── label_phone_align
│ └── arctic_a0001.lab
└── label_state_align
└── arctic_a0001.lab
And here’s our duration_model
dir after training:
josh@yoga:~/git/merlin/egs/slt_arctic/s1$ tree -d experiments/
experiments/
└── slt_arctic_demo
└── duration_model
├── data
│ ├── binary_label_416
│ │ └── arctic_a0001.lab
│ ├── dur
│ │ └── arctic_a0001.dur
│ ├── file_id_list_demo.scp
│ ├── label_norm_HTS_416.dat
│ ├── label_phone_align
│ │ └── arctic_a0001.lab
│ ├── label_state_align
│ │ └── arctic_a0001.lab
│ ├── lf0
│ ├── nn_dur_5
│ │ └── arctic_a0001.cmp
│ ├── nn_norm_dur_5
│ │ └── arctic_a0001.cmp
│ ├── nn_no_silence_lab_416
│ │ └── arctic_a0001.lab
│ ├── nn_no_silence_lab_norm_416
│ │ └── arctic_a0001.lab
│ ├── norm_info_dur_5_MVN.dat
│ ├── ref_data
│ │ └── arctic_a0051.dur
│ └── var
│ └── dur_5
├── gen
│ └── DNN_TANH_TANH_TANH_TANH_LINEAR__dur_1_50_416_5_4_512_512
│ ├── arctic_a0051.cmp
│ ├── arctic_a0051.dur
│ └── arctic_a0051.lab
├── log
│ ├── DNN_TANH_TANH_TANH_TANH_LINEAR__dur_50_259_4_512_0.002000_12_04PM_February_16_2017.log
│ └── DNN_TANH_TANH_TANH_TANH_LINEAR__dur_50_259_4_512_0.002000_12_04PM_February_16_2017.log.gitdiff
└── nnets_model
└── DNN_TANH_TANH_TANH_TANH_LINEAR__dur_0_4_512_512_512_512_416.5.train.50.0.002000.rnn.model
27 directories
nnets_model
I’ll let you go through the different dirs and files, but I will point out the most important file, the trained DNN model.
This new, shiny DNN can be found in the nnets_model
dir and, in my case, the filename is:
DNN_TANH_TANH_TANH_TANH_LINEAR__dur_0_4_512_512_512_512_416.5.train.50.0.002000.rnn.model
The filename is long, but it contains some very important information.
- activation functions used in our DNN layers:
TANH_TANH_TANH_TANH_LINEAR
- model type, acoustic or duration:
dur
- True or False, whether we have a multi-stream model:
0
- number of hidden layers:
4
- dimensionality of hidden layers:
512_512_512_512
- number of input nodes (i.e. the dimensionality of labels (phonemes) in our model):
416
- number of output nodes (i.e. the dimensionality of acoustic features to predict):
5
- the number of utterances used in training:
50
- the learning rate:
0.002000
Train the Acoustic Model
The process for training the acoustic model is exactly the same as that for training the duration model, the only difference being the configuration file which is used to initialize and train the DNN.
As you can see from the code snippet below from run_demo.sh
, we send a .conf
file to run_merlin.py
, and then use submit.sh
to send the job to our processor(s).
### Step 3: train acoustic model ###
echo "Step 3: training acoustic model..."
./scripts/submit.sh ${MerlinDir}/src/run_merlin.py conf/acoustic_${Voice}.conf
This process runs a lot longer than training the duration model. That’s because our output layer is much larger, and we have a lot more parameters in our DNN to train.
This took about an hour and a half on my laptop to train.
Now if we look into our experiments
dir to see what new things we’ve added to the acoustic_model
subdir, we see a similar picture to what we saw with the duration_model
subdir.
First, let’s look at the acoustic_model
subdir before training the DNN:
josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo$ tree acoustic_model/
acoustic_model/
└── data
├── bap
│ └── arctic_a0001.bap
├── file_id_list_demo.scp
├── label_phone_align
│ └── arctic_a0001.lab
├── label_state_align
│ └── arctic_a0001.lab
├── lf0
│ └── arctic_a0001.lf0
└── mgc
└── arctic_a0001.mgc
And here’s what we get after we run the training script:
josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo$ tree acoustic_model/
acoustic_model/
├── data
│ ├── bap
│ │ └── arctic_a0001.bap
│ ├── binary_label_425
│ │ └── arctic_a0001.lab
│ ├── dur
│ ├── file_id_list_demo.scp
│ ├── label_norm_HTS_425.dat
│ ├── label_phone_align
│ │ └── arctic_a0001.lab
│ ├── label_state_align
│ │ └── arctic_a0001.lab
│ ├── lf0
│ │ └── arctic_a0001.lf0
│ ├── mgc
│ │ └── arctic_a0001.mgc
│ ├── nn_mgc_lf0_vuv_bap_187
│ │ └── arctic_a0001.cmp
│ ├── nn_norm_mgc_lf0_vuv_bap_187
│ │ └── arctic_a0001.cmp
│ ├── nn_no_silence_lab_425
│ │ └── arctic_a0001.lab
│ ├── nn_no_silence_lab_norm_425
│ │ └── arctic_a0001.lab
│ ├── norm_info_mgc_lf0_vuv_bap_187_MVN.dat
│ ├── ref_data
│ │ ├── arctic_a0051.bap
│ │ ├── arctic_a0051.lf0
│ │ └── arctic_a0051.mgc
│ └── var
│ ├── bap_3
│ ├── lf0_3
│ ├── mgc_180
│ └── vuv_1
├── gen
│ └── DNN_TANH_TANH_TANH_TANH_LINEAR__mgc_lf0_vuv_bap_1_50_425_187_4_512_512
│ ├── arctic_a0051.bap
│ ├── arctic_a0051.cmp
│ ├── arctic_a0051.lf0
│ ├── arctic_a0051.mgc
│ ├── arctic_a0051.mgc_b0
│ ├── arctic_a0051.mgc_p_b0
│ ├── arctic_a0051.mgc_p_mgc
│ ├── arctic_a0051.mgc_p_r0
│ ├── arctic_a0051.mgc_r0
│ ├── arctic_a0051.wav
│ └── weight
├── log
│ ├── DNN_TANH_TANH_TANH_TANH_LINEAR__mgc_lf0_vuv_bap_50_259_4_512_0.002000_12_07PM_February_18_2017.log
│ └── DNN_TANH_TANH_TANH_TANH_LINEAR__mgc_lf0_vuv_bap_50_259_4_512_0.002000_12_07PM_February_18_2017.log.gitdiff
└── nnets_model
└── DNN_TANH_TANH_TANH_TANH_LINEAR__mgc_lf0_vuv_bap_0_4_512_512_512_512_425.187.train.50.0.002000.rnn.model
If you’ve made it to this point, you have completed training and have all you need to start synthesizing some speech!
Synthesize Speech
So let’s start synthesizing some speech now.
Just like in training, we use the main run_merlin.py
script for synthesis:
### Step 4: synthesize speech ###
echo "Step 4: synthesizing speech..."
./scripts/submit.sh ${MerlinDir}/src/run_merlin.py conf/test_dur_synth_${Voice}.conf
Thankfully, this step takes almost no time at all as you’ll see in the terminal session below. For this demo, we only are synthesizing 10 audio files:
Let’s take a look as to what this script has created by comparing our test_synthesis
dir before and after running this step:
Before running the synthesis script:
josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo$ tree test_synthesis/
test_synthesis/
├── prompt-lab
│ └── arctic_a0001.lab
└── test_id_list.scp
After running the duration synthesis script:
josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo$ tree test_synthesis/
test_synthesis/
├── gen-lab
│ ├── arctic_a0001.cmp
│ ├── arctic_a0001.dur
│ └── arctic_a0001.lab
├── prompt-lab
│ └── arctic_a0001.lab
└── test_id_list.scp
Now we run the acoustic synthesis script:
./scripts/submit.sh ${MerlinDir}/src/run_merlin.py conf/test_synth_${Voice}.conf
This synthesis step takes longer than the duration
synthesis step, but it is still relatively quick even on my laptop.
Here’s what we find in the test_synthesis
dir after we’ve generated our wav files:
josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo$ tree test_synthesis/
test_synthesis/
├── gen-lab
│ ├── arctic_a0001.cmp
│ ├── arctic_a0001.dur
│ ├── arctic_a0001.lab
├── prompt-lab
│ ├── arctic_a0001.lab
├── test_id_list.scp
└── wav
├── arctic_a0001.bap
├── arctic_a0001.cmp
├── arctic_a0001.lf0
├── arctic_a0001.mgc
├── arctic_a0001.mgc_b0
├── arctic_a0001.mgc_p_b0
├── arctic_a0001.mgc_p_mgc
├── arctic_a0001.mgc_p_r0
├── arctic_a0001.mgc_r0
├── arctic_a0001.wav
└── weight
Clean up
At this point, we’ve trained our DNNs and used them to generate new speech in the form of WAV files. We could very well stop now and just take our WAV files and do with them whatever we want.
However, in the offical Merlin demo, they add an extra step at the end that helps saves you some room on your disk.
josh@yoga:~/git/merlin/egs/slt_arctic/s1$ ./run_demo.sh
Step 5: deleting intermediate synthesis files...
synthesized audio files are in: experiments/slt_arctic_demo/test_synthesis/wav
All successfull!! Your demo voice is ready :)
Here’s our nice, cleaned up test_synthesis
dir:
josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo$ tree test_synthesis/
test_synthesis/
├── gen-lab
│ └── arctic_a0001.lab
├── prompt-lab
│ └── arctic_a0001.lab
├── test_id_list.scp
└── wav
└── arctic_a0001.wav
Conclusion
I hope you found this useful!
If you have issues or comments, leave a comment below.
Resources
Here’s a tutorial on Statistical parametric speech synthesis writen by Simon King, one of the creators of Merlin.