Getting started with the Merlin Speech Synthesis Toolkit

Introduction

This post is a short introduction to installing and using the Merlin Speech Synthesis toolkit.

In the following, I will display all the commands needed to (1) install Merlin from the official GitHub repository as well as (2) run the included demo.

In addition to showing and explaining the commands needed to install and run Merlin, I will also take some time to show how the scripts work and dive into the file structure expected by Merlin. However, I won’t get into any of the algorithms behind DNNs or speech synthesis.

More specifically, I will show which files are required by scripts, which files are generated by scripts, and how the main demo script run_demo.sh proceeds from data preparation to training to synthesis.

Here is a nice, concise description of the toolkit quoted directly from the official CSTR Merlin site:

Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It must be used in combination with a front-end text processor (e.g., Festival) and a vocoder (e.g., STRAIGHT or WORLD).

The system is written in Python and relies on the Theano numerical computation library.

Merlin comes with recipes (in the spirit of the Kaldi automatic speech recognition toolkit) to show you how to build state-of-the art systems.

Merlin is free software, distributed under an Apache License Version 2.0, allowing unrestricted commercial and non-commercial use alike.

Installation

Clone

Like good open-source software, the Merlin toolkit is hosted on GitHub and can be easily downloaded (cloned) with a single line of code:

josh@yoga:~/git$ git clone https://github.com/CSTR-Edinburgh/merlin.git
Clonage dans 'merlin'...
remote: Counting objects: 1515, done.
remote: Total 1515 (delta 0), reused 0 (delta 0), pack-reused 1514
Réception d'objets: 100% (1515/1515), 5.06 MiB | 682.00 KiB/s, done.
Résolution des deltas: 100% (734/734), done.
Vérification de la connectivité... fait.

Easy as that! Now let’s take a peek into what we just downloaded.

josh@yoga:~/git$ cd merlin/
josh@yoga:~/git/merlin$ ls
COPYING  CREDITS.md  doc  egs  INSTALL  LICENSE  misc  README.md  src  test  tools

We can immediately see some familiar file and directory names, like doc for documentation, egs for examples, src for source code, etc.

First things first, we should always start investigating new code with the README file:

josh@yoga:~/git/merlin$ cat README.md 
The Neural Network (NN) based Speech Synthesis System
=====================================================
  
This repository contains the Neural Network (NN) based Speech Synthesis System  
developed at the Centre for Speech Technology Research (CSTR), University of 
Edinburgh. 

To build the toolkit: see `./INSTALL`.  These instructions are valid for UNIX
systems including various flavors of Linux;

To run the example system builds, see `egs/README.txt`

As a first demo, please follow the scripts in `egs/slt_arctic`

Synthetic speech samples
------------------------

Listen to [synthetic speech samples](https://cstr-edinburgh.github.io/merlin/demo.html) from our demo voice.

Development pattern for contributors
------------------------------------

1. [Create a personal fork](https://help.github.com/articles/fork-a-repo/)
   of the [main Merlin repository] (https://github.com/CSTR-Edinburgh/merlin) in GitHub.
2. Make your changes in a named branch different from `master`, e.g. you create
   a branch `my-new-feature`.
3. [Generate a pull request](https://help.github.com/articles/creating-a-pull-request/)
   through the Web interface of GitHub.

Contact Us
----------

Post your questions, suggestions, and discussions to [GitHub Issues](https://github.com/CSTR-Edinburgh/merlin/issues).

Citation
--------

If you publish work based on Merlin, please cite: 

Zhizheng Wu, Oliver Watts, Simon King, "Merlin: An Open Source Neural Network Speech Synthesis System" in Proc. 9th ISCA Speech Synthesis Workshop (SSW9), September 2016, Sunnyvale, CA, USA.

Srikanth Ronanki, Zhizheng Wu, Oliver Watts, Simon King, "A Demonstration of the Merlin Open Source Neural Network Speech Synthesis System" in Proc. special demo session, 9th ISCA Speech Synthesis Workshop (SSW9), 2016, Sunnyvale, CA, USA.

This main README file gives us information on the toolkit, how to install it, how to work with it, and how to contact the team from the University of Edinburgh. For our current purposes, since we’re just trying to install and run the demo, let’s take a look into the INSTALL file.

josh@yoga:~/git/merlin$ cat ./INSTALL 
INSTALL
=======

(1) go to tools/  and follow INSTALL instructions there.

(2) Merlin is coded in python and need third-party python libraries such as:

numpy, scipy, matplotlib, lxml
    Usually shipped with your python packages
    Available in Ubuntu packages
theano
    Can be found on pip
    Need version 0.6 and above
    http://deeplearning.net/software/theano/
bandmat
    Can be found on pip
    https://pypi.python.org/pypi/bandmat

For running on NVIDIA GPU, you will need also CUDA
    https://developer.nvidia.com/cuda-zone
and you might want also CUDNN [optionnal]
    https://developer.nvidia.com/cudnn
    
Computationnal efficiency is obviously greatly improved using GPU.
It is also improved using the latest versions of theano.

So, we find that to install Merlin succesfully, we have to:

Follow the instructions in the tools/INSTALL file
Install some Python dependencies

Compile

First we cd into the tools dir:

josh@yoga:~/git/merlin$ cd tools/
josh@yoga:~/git/merlin/tools$ ls
compile_tools.sh  INSTALL  WORLD  WORLD_v2

Then we take a look into the new INSTALL file we find here:

josh@yoga:~/git/merlin/tools$ cat ./INSTALL 
INSTALL
=======

./compile_tools.sh

Simple as that! We have to just run that one compile_tools.sh script and we should be good to go (apart from Python dependencies).

So, when we run that script, we see something like this:

josh@yoga:~/git/merlin/tools$ ./compile_tools.sh 
downloading SPTK-3.9...
--2017-02-14 10:31:33--  http://downloads.sourceforge.net/sp-tk/SPTK-3.9.tar.gz
Résolution de downloads.sourceforge.net (downloads.sourceforge.net)… 216.34.181.59
Connexion à downloads.sourceforge.net (downloads.sourceforge.net)|216.34.181.59|:80… connecté.
requête HTTP transmise, en attente de la réponse… 301 Moved Permanently
Emplacement : http://downloads.sourceforge.net/project/sp-tk/SPTK/SPTK-3.9/SPTK-3.9.tar.gz [suivant]
--2017-02-14 10:31:34--  http://downloads.sourceforge.net/project/sp-tk/SPTK/SPTK-3.9/SPTK-3.9.tar.gz
Réutilisation de la connexion existante à downloads.sourceforge.net:80.
requête HTTP transmise, en attente de la réponse… 302 Moved Temporarily
Emplacement : https://netix.dl.sourceforge.net/project/sp-tk/SPTK/SPTK-3.9/SPTK-3.9.tar.gz [suivant]
--2017-02-14 10:31:34--  https://netix.dl.sourceforge.net/project/sp-tk/SPTK/SPTK-3.9/SPTK-3.9.tar.gz
Résolution de netix.dl.sourceforge.net (netix.dl.sourceforge.net)… 87.121.121.2
Connexion à netix.dl.sourceforge.net (netix.dl.sourceforge.net)|87.121.121.2|:443… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 1077702 (1.0M) [application/x-gzip]
Enregistre : «SPTK-3.9.tar.gz.1»

100%[===========================================================================================================================================================================>] 1,077,702    262KB/s   ds 4.0s   

2017-02-14 10:31:39 (262 KB/s) - «SPTK-3.9.tar.gz.1» enregistré [1077702/1077702]

compiling SPTK...
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking whether gcc understands -c and -o together... yes
checking for style of include used by make... GNU
                                .
                                .
                                .
test/audioio.cpp:92:38: warning: ignoring return value of ‘size_t fread(void*, size_t, size_t, FILE*)’, declared with attribute warn_unused_result [-Wunused-result]
       fread(&data_check[1], 1, 3, fp);
                                      ^
test/audioio.cpp:104:34: warning: ignoring return value of ‘size_t fread(void*, size_t, size_t, FILE*)’, declared with attribute warn_unused_result [-Wunused-result]
   fread(for_int_number, 1, 4, fp);  // "data"
                                  ^
mkdir -p ./build/objs/test
g++ -O1 -g -Wall -fPIC -Isrc -o "build/objs/test/analysis.o" -c "test/analysis.cpp"
g++ -O1 -g -Wall -fPIC -o ./build/analysis ./build/objs/test/audioio.o ./build/objs/test/analysis.o ./build/libworld.a -lm
mkdir -p ./build/objs/test
g++ -O1 -g -Wall -fPIC -Isrc -o "build/objs/test/synth.o" -c "test/synth.cpp"
test/synth.cpp: In function ‘void {anonymous}::WaveformSynthesis(WorldParameters*, int, int, double*)’:
test/synth.cpp:240:9: warning: variable ‘elapsed_time’ set but not used [-Wunused-but-set-variable]
   DWORD elapsed_time;
         ^
test/synth.cpp: In function ‘int main(int, char**)’:
test/synth.cpp:313:55: warning: ignoring return value of ‘size_t fread(void*, size_t, size_t, FILE*)’, declared with attribute warn_unused_result [-Wunused-result]
  fread(&world_parameters.f0[i], sizeof(double), 1, fp);
                                                       ^
test/synth.cpp:332:72: warning: ignoring return value of ‘size_t fread(void*, size_t, size_t, FILE*)’, declared with attribute warn_unused_result [-Wunused-result]
       fread(&world_parameters.spectrogram[i][j], sizeof(double), 1, fp);
                                                                        ^
test/synth.cpp:341:65: warning: ignoring return value of ‘size_t fread(void*, size_t, size_t, FILE*)’, declared with attribute warn_unused_result [-Wunused-result]
       fread(&coarse_aperiodicities[i][j], sizeof(double), 1, fp);
                                                                 ^
g++ -O1 -g -Wall -fPIC -o ./build/synth ./build/objs/test/audioio.o ./build/objs/test/synth.o ./build/libworld.a -lm
Removing all temporary binaries... 
Done.
All tools successfully compiled!!

Since the output to the terminal was pretty long, I cut out a big part in the middle as you can see.

Install Python Dependencies

So now that we have compiled our Merlin tools, let’s move on to the second step in the main merlin/INSTALL file where we are told to make sure we have the right Python dependencies installed. It’s not explicit, but I’m pretty sure we need Python2 and not Python3. In any case, I’ve tested this out with Python2 on my system and it seems to work ok.

So, let’s get all the dependencies in one fell swoop with the trusty pip program. This program helps us install Python packages securely and easily. You probably already have it installed, but if not, follow the simple instructions here.

We can install all the dependencies with one command:

josh@yoga:~/git/merlin$ pip install numpy scipy matplotlib lxml theano bandmat

Since I had all these packages installed already, I get the following messages to the terminal:

josh@yoga:~$ pip install numpy scipy matplotlib lxml theano bandmat
Requirement already satisfied (use --upgrade to upgrade): numpy in /usr/local/lib/python2.7/dist-packages
Requirement already satisfied (use --upgrade to upgrade): scipy in /usr/lib/python2.7/dist-packages
Requirement already satisfied (use --upgrade to upgrade): matplotlib in /usr/lib/pymodules/python2.7
Requirement already satisfied (use --upgrade to upgrade): lxml in /usr/lib/python2.7/dist-packages
Requirement already satisfied (use --upgrade to upgrade): theano in /usr/local/lib/python2.7/dist-packages
Requirement already satisfied (use --upgrade to upgrade): bandmat in /usr/local/lib/python2.7/dist-packages
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in /usr/lib/python2.7/dist-packages (from matplotlib)
Requirement already satisfied (use --upgrade to upgrade): tornado in /usr/local/lib/python2.7/dist-packages (from matplotlib)
Requirement already satisfied (use --upgrade to upgrade): pyparsing>=1.5.6 in /usr/lib/python2.7/dist-packages (from matplotlib)
Requirement already satisfied (use --upgrade to upgrade): nose in /usr/lib/python2.7/dist-packages (from matplotlib)
Requirement already satisfied (use --upgrade to upgrade): six>=1.9.0 in /usr/local/lib/python2.7/dist-packages (from theano)
Requirement already satisfied (use --upgrade to upgrade): backports.ssl-match-hostname in /usr/local/lib/python2.7/dist-packages (from tornado->matplotlib)
Requirement already satisfied (use --upgrade to upgrade): singledispatch in /usr/local/lib/python2.7/dist-packages (from tornado->matplotlib)
Requirement already satisfied (use --upgrade to upgrade): certifi in /usr/local/lib/python2.7/dist-packages (from tornado->matplotlib)
Requirement already satisfied (use --upgrade to upgrade): backports-abc>=0.4 in /usr/local/lib/python2.7/dist-packages (from tornado->matplotlib)
Cleaning up...

So now, at this point we have downloaded and compiled Merlin along with all its necessary dependencies!

Now let’s go on to run the demo.

Run the Demo

As per the main merlin/README.md file, we are invited to begin our demo with example slt_arctic.

So, let’s cd into that directory and take a look around.

josh@yoga:~/git/merlin$ cd egs/slt_arctic/s1/
josh@yoga:~/git/merlin/egs/slt_arctic/s1$ ls
conf  merlin_synthesis.sh  README.md  RESULTS.md  run_demo.sh  run_full_voice.sh  scripts  testrefs

First things first, let’s look at what the README has to tell us about this example.

josh@yoga:~/git/merlin/egs/slt_arctic/s1$ cat README.md 
Demo voice
----------

To run demo voice, please follow below steps:

Step 1: git clone https://github.com/CSTR-Edinburgh/merlin.git <br/>
Step 2: cd merlin/egs/slt_arctic/s1 <br/>
Step 3: ./run_demo.sh

Demo voice trains only on 50 utterances and shouldnt take more than 5 min. 

Compare the results in log files to baseline results from demo data in [RESULTS.md](https://github.com/CSTR-Edinburgh/merlin/blob/master/egs/slt_arctic/s1/RESULTS.md)

Full voice
----------

To run full voice, please follow below steps:

Step 1: git clone https://github.com/CSTR-Edinburgh/merlin.git <br/>
Step 2: cd merlin/egs/slt_arctic/s1 <br/>
Step 3: ./run_full_voice.sh

Full voice utilizes the whole arctic data (1132 utterances). The training of the voice approximately takes 1 to 2 hours. 

Compare the results in log files to baseline results from full data in [RESULTS.md](https://github.com/CSTR-Edinburgh/merlin/blob/master/egs/slt_arctic/s1/RESULTS.md)

Generate new sentences
----------------------

To generate new sentences, please follow below steps:

Step 1: Run either demo voice or full voice. <br/>
Step 2: ./merlin_synthesis.sh

So, we’re going to start with the simpler demo voice as suggested.

Pre-existing Dirs & Files

Before we just go and run that run_demo.sh, lets investigate what data and scripts we have so far so that we can get an idea of what Merlin requires and what the workflow is.

I always like to use the tree program for Linux for looking at file structures. It’s a very simple program that will list the contents of a directory recursively.

When we look at the file structure for the slt_arctic/s1 dir, before we’ve run any scripts, we see this:

josh@yoga:~/git/merlin/egs/slt_arctic$ tree s1/
s1/
├── conf
│   ├── dnn
│   │   ├── acoustic_slt_arctic_full.conf
│   │   ├── duration_slt_arctic_full.conf
│   │   ├── test_dur_synth_slt_arctic_full.conf
│   │   └── test_synth_slt_arctic_full.conf
│   └── logging_config.conf
├── merlin_synthesis.sh
├── README.md
├── RESULTS.md
├── run_demo.sh
├── run_full_voice.sh
├── scripts
│   ├── prepare_config_files_for_synthesis.sh
│   ├── prepare_config_files.sh
│   ├── prepare_labels_from_txt.sh
│   ├── remove_intermediate_files.sh
│   ├── setup.sh
│   └── submit.sh
└── testrefs
    ├── slt_arctic_demo
    │   ├── acoustic_model
    │   │   └── log
    │   │       └── DNN_TANH_TANH_TANH_TANH_LINEAR__mgc_lf0_vuv_bap_50_259_4_512_0.002000_04_59PM_August_23_2016.log
    │   └── duration_model
    │       └── log
    │           └── DNN_TANH_TANH_TANH_TANH_LINEAR__dur_50_259_4_512_0.002000_04_57PM_August_23_2016.log
    └── slt_arctic_full
        ├── acoustic_model
        │   └── log
        │       └── DNN_TANH_TANH_TANH_TANH_LINEAR__mgc_lf0_vuv_bap_1000_259_4_512_0.002000_08_47PM_August_30_2016.log
        └── duration_model
            └── log
                └── DNN_TANH_TANH_TANH_TANH_LINEAR__dur_1000_259_4_512_0.002000_08_44PM_August_30_2016.log

14 directories, 20 files

So, what should we take away from the files and directories present?

`conf`

The first dir we find in slt_arctic/s1 is conf (short for configuration).

The conf dir contains one configuration file for logging (logging_config.conf), as well as one dir (dnn) which contains DNN configuration files.

In the conf/dnn dir, the DNN “conf” files define some information about the paths to relevant directories, information about the training data, and the architecture of the DNNs we want to train.

There are a total of four (4) DNN configuration files, the first two for training and the last two for testing:

josh@yoga:~/git/merlin/egs/slt_arctic/s1/conf$ tree dnn/
dnn/
├── acoustic_slt_arctic_full.conf
├── duration_slt_arctic_full.conf
├── test_dur_synth_slt_arctic_full.conf
└── test_synth_slt_arctic_full.conf

0 directories, 4 files

In Merlin, we don’t just model the acoustics of phonemes of the language. We also model their durations. For both (1) acoustic modeling and (2) duration modeling, we use DNNs, and as such we have two configuration files as seen in the conf/dnn dir above.

Quoting from the team’s demonstration paper, they concisely describe the duration model as such:

Duration modelling Merlin models duration using a separate DNN to the acoustic model. The duration model is trained on the aligned data, to predict phone- and/or state-level durations. At synthesis time, duration is predicted first, and is used as an input to the acoustic model to predict the speech parameters.

As such, we need two DNNs, and that means we need two configuration files.

`scripts`

Moving onto the second directory within s1, we find the location of our data preparation scripts.

Logically, this directory is labeled scripts.

Let’s take a look inside:

josh@yoga:~/git/merlin/egs/slt_arctic/s1$ tree scripts/
scripts/
├── prepare_config_files_for_synthesis.sh
├── prepare_config_files.sh
├── prepare_labels_from_txt.sh
├── remove_intermediate_files.sh
├── setup.sh
└── submit.sh

0 directories, 6 files

The first four data preparation scripts in the scripts dir have very transparent filenames, so I won’t elaborate on them.

However, we should note that the main purpose of the setup.sh script is to:

download the demo data
create the main directories to house the current experiment
move test and train data into those directories
define the global configuration file

Moving on, the last located in the s1/scripts/ dir is the submit.sh script. The name is transparent once you know that this script will take any Theano job and submit it to either a GPU or CPU, depending on what you have available.

`testrefs`

Moving back up a level to the s1/ dir, the last of the three main dirs is testrefs. This dir contains only four (4) files, which are all log files from training performed by the CSTR team. These files can be used to compare against our own training in case we hit any problems.

josh@yoga:~/git/merlin/egs/slt_arctic/s1$ tree testrefs/
testrefs/
├── slt_arctic_demo
│   ├── acoustic_model
│   │   └── log
│   │       └── DNN_TANH_TANH_TANH_TANH_LINEAR__mgc_lf0_vuv_bap_50_259_4_512_0.002000_04_59PM_August_23_2016.log
│   └── duration_model
│       └── log
│           └── DNN_TANH_TANH_TANH_TANH_LINEAR__dur_50_259_4_512_0.002000_04_57PM_August_23_2016.log
└── slt_arctic_full
    ├── acoustic_model
    │   └── log
    │      └── DNN_TANH_TANH_TANH_TANH_LINEAR__mgc_lf0_vuv_bap_1000_259_4_512_0.002000_08_47PM_August_30_2016.log
    └── duration_model
        └── log
            └── DNN_TANH_TANH_TANH_TANH_LINEAR__dur_1000_259_4_512_0.002000_08_44PM_August_30_2016.log

6 directories, 4 files

At this point, we’ve gone over the content in each of the three dirs in the first level of the main s1 dir.

Specifically, we mentioned:

s1/conf: contains configuration files for building, training, and testing our DNNs
s1/scripts: contains scripts for preparing data and submitting Theano jobs
s1/testrefs: contains log files from the CSTR team for our reference

Run the `run_demo.sh` Script

Now that we’ve gone over our pre-existing dirs, let’s go on to the run script and start running the demo.

The s1/README.md file directs us to run the run_demo.sh script first, so I will start there.

So, walking through the run_demo.sh script, the first thing we see is three scripts in the data prep stage:

### Step 1: setup directories and the training data files ###
echo "Step 1: setting up experiments directory and the training data files..."
global_config_file=conf/global_settings.cfg
./scripts/setup.sh slt_arctic_demo
./scripts/prepare_config_files.sh $global_config_file
./scripts/prepare_config_files_for_synthesis.sh $global_config_file

If we take the time to run these scripts one-by-one, we can more easily see what’s going on. I’m going to run each script here, and look into what’s being output by each one.

Prepare Data & Create Config File

setup.sh does two important things:

download, unzip, and move training and testing data into new experiments dir
create global configuration file, global_settings.cfg, in our conf dir

So, beginning by running only the setup.sh script (from within run_demo.sh), we get the following output to the terminal:

josh@yoga:~/git/merlin/egs/slt_arctic/s1$ ./run_demo.sh
Step 1: setting up experiments directory and the training data files...
downloading data.....
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12.9M  100 12.9M    0     0  2856k      0  0:00:04  0:00:04 --:--:-- 3181k
unzipping files......
data is ready!
Merlin default voice settings configured in conf/global_settings.cfg
setup done...!

After running setup.sh, if we look into the file structure of the s1 dir (for just the first level, hence the -L 1 flag), we can see we’ve created a few new things.

I’ve put asterisks (**) around the newly created dirs and file:

josh@yoga:~/git/merlin/egs/slt_arctic/s1$ tree -L 1 .
.
├── conf
├── **experiments**
├── merlin_synthesis.sh
├── README.md
├── RESULTS.md
├── run_demo.sh
├── run_full_voice.sh
├── scripts
├── **slt_arctic_demo_data**
├── **slt_arctic_demo_data.zip**
└── testrefs

5 directories, 6 files

Specifically, we have created the following:

experiments dir
slt_arctic_demo_data dir
slt_arctic_demo_data.zip compressed file

What happened was the following:

the relevant data files were downloaded from Srikanth Ronanki’s homepage, hence the slt_arctic_demo_data.zip file
this compressed file was uncompressed and saved as slt_arctic_demo_data
the relevant files were copied from slt_arctic_demo_data into the new experiments directory

As such, at this point the experiments dir contains mostly just extracted audio feature files.

Here’s what we find in this new experiments dir:

josh@yoga:~/git/merlin/egs/slt_arctic/s1$ tree experiments/
experiments/
└── slt_arctic_demo
    ├── acoustic_model
    │   └── data
    │       ├── bap
    │       │   └── arctic_a0001.bap
    │       ├── file_id_list_demo.scp
    │       ├── label_phone_align
    │       │   └── arctic_a0001.lab
    │       ├── label_state_align
    │       │   └── arctic_a0001.lab
    │       ├── lf0
    │       │   └── arctic_a0001.lf0
    │       └── mgc
    │           └── arctic_a0001.mgc
    ├── duration_model
    │   └── data
    │       ├── file_id_list_demo.scp
    │       ├── label_phone_align
    │       │   └── arctic_a0001.lab
    │       └── label_state_align
    │           └── arctic_a0001.lab
    └── test_synthesis
        ├── prompt-lab
        │   └── arctic_a0001.lab
        └── test_id_list.scp

14 directories, 433 files

In the above output I’ve omitted displaying all feature files except for arctic_a0001 because there’s a lot of them.

Specifically, there are 430 feature files in total.

Since there’s a lot of stuff going on in the experiments dir, I think it’s worth the time to briefly explain what we have downloaded in terms of audio feature, label, and utterance ID files.

File Formats in `experiments` Dir

In terms of the file formats, we find the following:

*.bap: band a-periodicities
*.lab: label files (time-to-phone alignments)
*.lf0: log-fundamental frequencies
*.mgc: generalized cepstral coefficients
*.scp: script file for filenames

`*.bap`

The first file type, *.bap, is a kind of feature extracted from the audio, and we have one file for every audio file in our data set. We look into the *.bap file itself with the help of the x2x program which is included in Merlin via SPTK (Speech Signal Processing Toolkit). In the following command we use +fa to convert our binary floats (“f”) to ASCII (“a”).

The only reason I’m using head and tail here is to get a section of the file which has numbers other than zeros, so it’s more interesting.

josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo/acoustic_model/data/bap$ ../../../../../../../../tools/bin/SPTK-3.9/x2x +fa arctic_a0001.bap | head -50 | tail -10
0
0
0
0
-1.74326
-1.80508
-7.27617
-6.94787
-3.88911
-5.94744

`*.lab`

The second file type in the experiments dir is *.lab. These files are the “label” files which contain alignments for either phones or states to our audio files from our data set. We have two kinds of files here: (1) for phoneme alignments, and (2) for state alignments.

Quoting from email communication with Srikanth Ronanki (one of the Merlin developers):

To train a voice using Merlin, you either require phone-level or state-level alignments but not both. This option is configurable in global_settings.cfg.

Only very few international languages have enough resources to derive state-level alignments using HMMs – therefore Merlin is configured to even run with phone-level alignments which can be manually-labelled. But, the default demo runs using state-level alignments.

First, for phoneme alignments, we see something like this:

josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo/acoustic_model/data/label_phone_align$ head arctic_a0001.lab 
0 2050000 x^x-sil+sil=ao@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:0+0+0/D:0_0/E:x+x@x+x&x+x#x+x/F:0_0/G:0_0/H:x=x@1=2|0/I:0=0/J:14+8-2
2050000 3400000 sil^sil-ao+th=er@1_2/A:0_0_0/B:1-1-2@1-2&1-7#1-4$1-3!0-2;0-4|ao/C:0+0+1/D:0_0/E:content+2@1+5&1+2#0+3/F:in_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2
3400000 4650000 sil^ao-th+er=ah@2_1/A:0_0_0/B:1-1-2@1-2&1-7#1-4$1-3!0-2;0-4|ao/C:0+0+1/D:0_0/E:content+2@1+5&1+2#0+3/F:in_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2
4650000 5950000 ao^th-er+ah=v@1_1/A:1_1_2/B:0-0-1@2-1&2-6#1-4$1-3!1-1;1-3|er/C:1+0+2/D:0_0/E:content+2@1+5&1+2#0+3/F:in_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2
5950000 6650000 th^er-ah+v=dh@1_2/A:0_0_1/B:1-0-2@1-1&3-5#1-3$1-3!2-2;2-2|ah/C:0+0+2/D:content_2/E:in+1@2+4&2+2#1+2/F:det_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2
6650000 7650000 er^ah-v+dh=ax@2_1/A:0_0_1/B:1-0-2@1-1&3-5#1-3$1-3!2-2;2-2|ah/C:0+0+2/D:content_2/E:in+1@2+4&2+2#1+2/F:det_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2
7650000 8200000 ah^v-dh+ax=d@1_2/A:1_0_2/B:0-0-2@1-1&4-4#2-3$1-3!1-1;3-1|ax/C:1+1+4/D:in_1/E:det+1@3+3&2+2#2+1/F:content_2/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2
8200000 8500000 v^dh-ax+d=ey@2_1/A:1_0_2/B:0-0-2@1-1&4-4#2-3$1-3!1-1;3-1|ax/C:1+1+4/D:in_1/E:det+1@3+3&2+2#2+1/F:content_2/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2
8500000 9450000 dh^ax-d+ey=n@1_4/A:0_0_2/B:1-1-4@1-2&5-3#2-2$1-2!2-2;4-2|ey/C:0+0+1/D:det_1/E:content+2@4+2&2+1#3+1/F:content_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2
9450000 10450000 ax^d-ey+n=jh@2_3/A:0_0_2/B:1-1-4@1-2&5-3#2-2$1-2!2-2;4-2|ey/C:0+0+1/D:det_1/E:content+2@4+2&2+1#3+1/F:content_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2

For state-level alignments, we get the following:

josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo/acoustic_model/data/label_state_align$ head arctic_a0001.lab 
0 50000 x^x-sil+sil=ao@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:0+0+0/D:0_0/E:x+x@x+x&x+x#x+x/F:0_0/G:0_0/H:x=x@1=2|0/I:0=0/J:14+8-2[2]
50000 100000 x^x-sil+sil=ao@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:0+0+0/D:0_0/E:x+x@x+x&x+x#x+x/F:0_0/G:0_0/H:x=x@1=2|0/I:0=0/J:14+8-2[3]
100000 150000 x^x-sil+sil=ao@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:0+0+0/D:0_0/E:x+x@x+x&x+x#x+x/F:0_0/G:0_0/H:x=x@1=2|0/I:0=0/J:14+8-2[4]
150000 1700000 x^x-sil+sil=ao@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:0+0+0/D:0_0/E:x+x@x+x&x+x#x+x/F:0_0/G:0_0/H:x=x@1=2|0/I:0=0/J:14+8-2[5]
1700000 2050000 x^x-sil+sil=ao@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:0+0+0/D:0_0/E:x+x@x+x&x+x#x+x/F:0_0/G:0_0/H:x=x@1=2|0/I:0=0/J:14+8-2[6]
2050000 2400000 sil^sil-ao+th=er@1_2/A:0_0_0/B:1-1-2@1-2&1-7#1-4$1-3!0-2;0-4|ao/C:0+0+1/D:0_0/E:content+2@1+5&1+2#0+3/F:in_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2[2]
2400000 2550000 sil^sil-ao+th=er@1_2/A:0_0_0/B:1-1-2@1-2&1-7#1-4$1-3!0-2;0-4|ao/C:0+0+1/D:0_0/E:content+2@1+5&1+2#0+3/F:in_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2[3]
2550000 2650000 sil^sil-ao+th=er@1_2/A:0_0_0/B:1-1-2@1-2&1-7#1-4$1-3!0-2;0-4|ao/C:0+0+1/D:0_0/E:content+2@1+5&1+2#0+3/F:in_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2[4]
2650000 2700000 sil^sil-ao+th=er@1_2/A:0_0_0/B:1-1-2@1-2&1-7#1-4$1-3!0-2;0-4|ao/C:0+0+1/D:0_0/E:content+2@1+5&1+2#0+3/F:in_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2[5]
2700000 3400000 sil^sil-ao+th=er@1_2/A:0_0_0/B:1-1-2@1-2&1-7#1-4$1-3!0-2;0-4|ao/C:0+0+1/D:0_0/E:content+2@1+5&1+2#0+3/F:in_1/G:0_0/H:7=5@1=2|L-L%/I:7=3/J:14+8-2[6]

For a little sanity check, we can see that the two *.lab files for the same audio file have a different number of lines.

Each line in a *.lab file represents a single alignment in time. As such, we would expect that for a given audio file, its state alignment .lab file will have more lines than its phoneme .lab file. This is because one phoneme is made up of multiple states in HMM-based ASR (ie. the machinery used to generate alignments).

We’d have to look more into the details of the acoustic model used to do forced alignment to generate the labels, but the difference between number of lines in the alignment files are around the expected numbers:

josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo/acoustic_model/data$ wc -l label_phone_align/arctic_a0001.lab 
37 label_phone_align/arctic_a0001.lab
josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo/acoustic_model/data$ wc -l label_state_align/arctic_a0001.lab 
185 label_state_align/arctic_a0001.lab

In this case, we have for the audio file arctic_a0001.lab a total of 37 phonemes and 185 states. This comes out to 185/37 = 5 states per phoneme.

`*.lf0`

The *.lf0 files are the log-fundamental frequency files, aka, another kind of feature file extracted from our audio files in our data set.

Again here, we’re using the x2x program to get something human-readable from the feature file:

josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo/acoustic_model/data/lf0$ ../../../../../../../../tools/bin/SPTK-3.9/x2x +fa arctic_a0001.lf0 | head -50 | tail -10
-1e+10
-1e+10
-1e+10
-1e+10
5.63102
5.70239
5.6373
5.46527
5.46528
5.45657

I’ve deleted a bunch of empty lines in the above output so as not to take up so much space.

`*.mgc`

Next, we move onto our next feature file type: *.mgc. These files contain the generalized cepstral coefficients for our audio files in our data set. Again, this is not very human readable:

josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo/acoustic_model/data/mgc$ ../../../../../../../../tools/bin/SPTK-3.9/x2x +fa arctic_a0001.mgc | head -50 | tail -10
-0.0696582
-0.0268611
0.0244125
0.038228
-0.0542025
0.0157833
0.0227506
-0.0430836
0.0407014
-0.0229975

`*.scp`

Moving on to the next file type, the *.scp files here, I’m guessing, are similar to Kaldi’s *.scp files. These files are “script” files which typically contain lists of information. In this case, the experiments dir contains two *.scp files which contain lists of file ids. For example:

josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo/acoustic_model/data$ head file_id_list_demo.scp 
arctic_a0001
arctic_a0002
arctic_a0003
arctic_a0004
arctic_a0005
arctic_a0006
arctic_a0007
arctic_a0008
arctic_a0009
arctic_a0010

Now that we’ve gone through the experiments dir and the format of its file contents, here’s a short recap of where we are:

We’ve just run the setup.sh script. This script downloaded, unzipped, and formatted the data we need for training and testing. The data we need includes (1) various audio feature files, (2) label files, and (3) utterance lists.

`global_settings.cfg`

We shouldn’t forget that the setup.sh script created a new, global configuration file in the conf dir.

This new configuration file is called global_settings.cfg.

josh@yoga:~/git/merlin/egs/slt_arctic/s1$ tree conf/
conf/
├── dnn
│   ├── acoustic_slt_arctic_full.conf
│   ├── duration_slt_arctic_full.conf
│   ├── test_dur_synth_slt_arctic_full.conf
│   └── test_synth_slt_arctic_full.conf
├── **global_settings.cfg**
└── logging_config.conf

1 directory, 6 files

This global config file (conf/global_settings.cfg) will contain information on where the compiled Merlin programs are located, where the current working dir is, what kind of Vocoder we’re using, and how many files to use for training and testing.

That’s all for setup.sh!

Prepare Training Config Files

Now we’re going to run just the prepare_config_files.sh script from within run_demo.sh, just like we did above with setup.sh), and take a look at what it does.

This script produces two configuration files for training:

duration_slt_arctic_demo.conf
acoustic_slt_arctic_demo.conf

When we run the script, we only get two lines of output to the terminal:

josh@yoga:~/git/merlin/egs/slt_arctic/s1$ ./run_demo.sh 
Duration configuration settings stored in conf/duration_slt_arctic_demo.conf
Acoustic configuration settings stored in conf/acoustic_slt_arctic_demo.conf

Sure enough, as promised by our two above messages, when we look into our pre-existing conf directory, we find two new configuration files:

josh@yoga:~/git/merlin/egs/slt_arctic/s1$ tree conf/
conf/
├── **acoustic_slt_arctic_demo.conf**
├── dnn
│   ├── acoustic_slt_arctic_full.conf
│   ├── duration_slt_arctic_full.conf
│   ├── test_dur_synth_slt_arctic_full.conf
│   └── test_synth_slt_arctic_full.conf
├── **duration_slt_arctic_demo.conf**
├── global_settings.cfg
└── logging_config.conf

1 directory, 8 files

These two new configuration files are of the same form as the original configuration files we found pre-existing when we cloned Merlin from GitHub.

They define architecture and training procedure for the acoustic model DNN and the duration model DNN.

Pretty straightforward overall.

Prepare Testing Config Files

This section deals with the prepare_config_files_for_synthesis.sh script called from run_demo.sh.

In the previous script we created the configuration files for our training procedure, now we do the same for our testing (aka aynthesis) phase.

Just like before, we get two lines of output to the terminal.

josh@yoga:~/git/merlin/egs/slt_arctic/s1$ ./run_demo.sh 
Duration configuration settings stored in conf/test_dur_synth_slt_arctic_demo.conf
Acoustic configuration settings stored in conf/test_synth_slt_arctic_demo.conf

Also, just like before, we have two new files in our conf dir:

josh@yoga:~/git/merlin/egs/slt_arctic/s1$ tree conf/
conf/
├── acoustic_slt_arctic_demo.conf
├── dnn
│   ├── acoustic_slt_arctic_full.conf
│   ├── duration_slt_arctic_full.conf
│   ├── test_dur_synth_slt_arctic_full.conf
│   └── test_synth_slt_arctic_full.conf
├── duration_slt_arctic_demo.conf
├── global_settings.cfg
├── logging_config.conf
├── **test_dur_synth_slt_arctic_demo.conf**
└── **test_synth_slt_arctic_demo.conf**

1 directory, 10 files

Train the Duration Model

If you’ve gotten here and followed all the previous steps, we’re ready to finally start training our DNNs… huzzah!

You’ll see in the run_demo.sh script that after the data preparation phase is completed, we send a Theano job to our processing unit (CPU or GPU) and we specify that we want to train the duration model:

### Step 2: train duration model ###
echo "Step 2: training duration model..."
./scripts/submit.sh ${MerlinDir}/src/run_merlin.py conf/duration_${Voice}.conf

As you can see, what happens here is that the main Merlin run script run_merlin.py is called and given its argument, duration_${Voice}.conf, a configuration file for the duration DNN model.

This run_merlin.py script is then fed into another script, submit.py where it is processed as a job and sent to your computer’s CPU(s) or GPU(s).

As such, the main script of interest in this training step is run_merlin.py.

This is a very clearly written Python script, and I invite you to take a look yourself into the code. Here I will merely show you the main train function and its arguments and defaults:

def train_DNN(train_xy_file_list,               # training file list
              valid_xy_file_list,               # validation file list
              nnets_file_name,                  # filename for DNN we save to disk
              n_ins,                            # input feature dimensionality
              n_outs,                           # output feature dimensionality
              ms_outs,                          # multistream_outs
              hyper_params,                     # hyperparameters for training and architecture
              buffer_size,                      # training buffer size
              plot=False,                       # create plot of (train/dev) training convergence
              var_dict=None,                    # load covariance matrix
              cmp_mean_vector = None,           # cmp == audio features used in HTS training
              cmp_std_vector = None,            # cmp == audio features used in HTS training
              init_dnn_model_file = None):      # DNN model with which we initialize new DNN

Since this step in run_demo.sh produces a lot of output, I decided to record my terminal session and embed it here, for those interested in the details.

After the training is complete, we can see that we’ve produced some new dirs and files in our experiments/duration_model directory:

Here’s our duration_model dir before training:

josh@yoga:~/git/merlin/egs/slt_arctic/s1$ tree -d experiments/
experiments/
└── slt_arctic_demo
    └── duration_model
        └── data
            ├── file_id_list_demo.scp
            ├── label_phone_align
            │   └── arctic_a0001.lab
            └── label_state_align
                └── arctic_a0001.lab

And here’s our duration_model dir after training:

josh@yoga:~/git/merlin/egs/slt_arctic/s1$ tree -d experiments/
experiments/
└── slt_arctic_demo
     └── duration_model
         ├── data
         │   ├── binary_label_416
         │   │   └── arctic_a0001.lab
         │   ├── dur
         │   │   └── arctic_a0001.dur
         │   ├── file_id_list_demo.scp
         │   ├── label_norm_HTS_416.dat
         │   ├── label_phone_align
         │   │   └── arctic_a0001.lab
         │   ├── label_state_align
         │   │   └── arctic_a0001.lab
         │   ├── lf0
         │   ├── nn_dur_5
         │   │   └── arctic_a0001.cmp
         │   ├── nn_norm_dur_5
         │   │   └── arctic_a0001.cmp
         │   ├── nn_no_silence_lab_416
         │   │   └── arctic_a0001.lab
         │   ├── nn_no_silence_lab_norm_416
         │   │   └── arctic_a0001.lab
         │   ├── norm_info_dur_5_MVN.dat
         │   ├── ref_data
         │   │   └── arctic_a0051.dur
         │   └── var
         │       └── dur_5
         ├── gen
         │   └── DNN_TANH_TANH_TANH_TANH_LINEAR__dur_1_50_416_5_4_512_512
         │       ├── arctic_a0051.cmp
         │       ├── arctic_a0051.dur
         │       └── arctic_a0051.lab
         ├── log
         │   ├── DNN_TANH_TANH_TANH_TANH_LINEAR__dur_50_259_4_512_0.002000_12_04PM_February_16_2017.log
         │   └── DNN_TANH_TANH_TANH_TANH_LINEAR__dur_50_259_4_512_0.002000_12_04PM_February_16_2017.log.gitdiff
         └── nnets_model
             └── DNN_TANH_TANH_TANH_TANH_LINEAR__dur_0_4_512_512_512_512_416.5.train.50.0.002000.rnn.model


27 directories

`nnets_model`

I’ll let you go through the different dirs and files, but I will point out the most important file, the trained DNN model.

This new, shiny DNN can be found in the nnets_model dir and, in my case, the filename is:

DNN_TANH_TANH_TANH_TANH_LINEAR__dur_0_4_512_512_512_512_416.5.train.50.0.002000.rnn.model

The filename is long, but it contains some very important information.

activation functions used in our DNN layers: TANH_TANH_TANH_TANH_LINEAR
model type, acoustic or duration: dur
True or False, whether we have a multi-stream model: 0
number of hidden layers: 4
dimensionality of hidden layers: 512_512_512_512
number of input nodes (i.e. the dimensionality of labels (phonemes) in our model): 416
number of output nodes (i.e. the dimensionality of acoustic features to predict): 5
the number of utterances used in training: 50
the learning rate: 0.002000

Train the Acoustic Model

The process for training the acoustic model is exactly the same as that for training the duration model, the only difference being the configuration file which is used to initialize and train the DNN.

As you can see from the code snippet below from run_demo.sh, we send a .conf file to run_merlin.py, and then use submit.sh to send the job to our processor(s).

### Step 3: train acoustic model ###
echo "Step 3: training acoustic model..."
./scripts/submit.sh ${MerlinDir}/src/run_merlin.py conf/acoustic_${Voice}.conf

This process runs a lot longer than training the duration model. That’s because our output layer is much larger, and we have a lot more parameters in our DNN to train.

This took about an hour and a half on my laptop to train.

Now if we look into our experiments dir to see what new things we’ve added to the acoustic_model subdir, we see a similar picture to what we saw with the duration_model subdir.

First, let’s look at the acoustic_model subdir before training the DNN:

josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo$ tree acoustic_model/
acoustic_model/
└── data
    ├── bap
    │   └── arctic_a0001.bap
    ├── file_id_list_demo.scp
    ├── label_phone_align
    │   └── arctic_a0001.lab
    ├── label_state_align
    │   └── arctic_a0001.lab
    ├── lf0
    │   └── arctic_a0001.lf0
    └── mgc
        └── arctic_a0001.mgc

And here’s what we get after we run the training script:

josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo$ tree acoustic_model/
acoustic_model/
├── data
│   ├── bap
│   │   └── arctic_a0001.bap
│   ├── binary_label_425
│   │   └── arctic_a0001.lab
│   ├── dur
│   ├── file_id_list_demo.scp
│   ├── label_norm_HTS_425.dat
│   ├── label_phone_align
│   │   └── arctic_a0001.lab
│   ├── label_state_align
│   │   └── arctic_a0001.lab
│   ├── lf0
│   │   └── arctic_a0001.lf0
│   ├── mgc
│   │   └── arctic_a0001.mgc
│   ├── nn_mgc_lf0_vuv_bap_187
│   │   └── arctic_a0001.cmp
│   ├── nn_norm_mgc_lf0_vuv_bap_187
│   │   └── arctic_a0001.cmp
│   ├── nn_no_silence_lab_425
│   │   └── arctic_a0001.lab
│   ├── nn_no_silence_lab_norm_425
│   │   └── arctic_a0001.lab
│   ├── norm_info_mgc_lf0_vuv_bap_187_MVN.dat
│   ├── ref_data
│   │   ├── arctic_a0051.bap
│   │   ├── arctic_a0051.lf0
│   │   └── arctic_a0051.mgc
│   └── var
│       ├── bap_3
│       ├── lf0_3
│       ├── mgc_180
│       └── vuv_1
├── gen
│   └── DNN_TANH_TANH_TANH_TANH_LINEAR__mgc_lf0_vuv_bap_1_50_425_187_4_512_512
│       ├── arctic_a0051.bap
│       ├── arctic_a0051.cmp
│       ├── arctic_a0051.lf0
│       ├── arctic_a0051.mgc
│       ├── arctic_a0051.mgc_b0
│       ├── arctic_a0051.mgc_p_b0
│       ├── arctic_a0051.mgc_p_mgc
│       ├── arctic_a0051.mgc_p_r0
│       ├── arctic_a0051.mgc_r0
│       ├── arctic_a0051.wav
│       └── weight
├── log
│   ├── DNN_TANH_TANH_TANH_TANH_LINEAR__mgc_lf0_vuv_bap_50_259_4_512_0.002000_12_07PM_February_18_2017.log
│   └── DNN_TANH_TANH_TANH_TANH_LINEAR__mgc_lf0_vuv_bap_50_259_4_512_0.002000_12_07PM_February_18_2017.log.gitdiff
└── nnets_model
    └── DNN_TANH_TANH_TANH_TANH_LINEAR__mgc_lf0_vuv_bap_0_4_512_512_512_512_425.187.train.50.0.002000.rnn.model

If you’ve made it to this point, you have completed training and have all you need to start synthesizing some speech!

Synthesize Speech

So let’s start synthesizing some speech now.

Just like in training, we use the main run_merlin.py script for synthesis:

### Step 4: synthesize speech   ###
echo "Step 4: synthesizing speech..."
./scripts/submit.sh ${MerlinDir}/src/run_merlin.py conf/test_dur_synth_${Voice}.conf

Thankfully, this step takes almost no time at all as you’ll see in the terminal session below. For this demo, we only are synthesizing 10 audio files:

Let’s take a look as to what this script has created by comparing our test_synthesis dir before and after running this step:

Before running the synthesis script:

josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo$ tree test_synthesis/
test_synthesis/
├── prompt-lab
│   └── arctic_a0001.lab
└── test_id_list.scp

After running the duration synthesis script:

josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo$ tree test_synthesis/
test_synthesis/
├── gen-lab
│   ├── arctic_a0001.cmp
│   ├── arctic_a0001.dur
│   └── arctic_a0001.lab
├── prompt-lab
│   └── arctic_a0001.lab
└── test_id_list.scp

Now we run the acoustic synthesis script:

./scripts/submit.sh ${MerlinDir}/src/run_merlin.py conf/test_synth_${Voice}.conf

This synthesis step takes longer than the duration synthesis step, but it is still relatively quick even on my laptop.

Here’s what we find in the test_synthesis dir after we’ve generated our wav files:

josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo$ tree test_synthesis/
test_synthesis/
    ├── gen-lab
    │   ├── arctic_a0001.cmp
    │   ├── arctic_a0001.dur
    │   ├── arctic_a0001.lab
    ├── prompt-lab
    │   ├── arctic_a0001.lab
    ├── test_id_list.scp
    └── wav
        ├── arctic_a0001.bap
        ├── arctic_a0001.cmp
        ├── arctic_a0001.lf0
        ├── arctic_a0001.mgc
        ├── arctic_a0001.mgc_b0
        ├── arctic_a0001.mgc_p_b0
        ├── arctic_a0001.mgc_p_mgc
        ├── arctic_a0001.mgc_p_r0
        ├── arctic_a0001.mgc_r0
        ├── arctic_a0001.wav
        └── weight

Clean up

At this point, we’ve trained our DNNs and used them to generate new speech in the form of WAV files. We could very well stop now and just take our WAV files and do with them whatever we want.

However, in the offical Merlin demo, they add an extra step at the end that helps saves you some room on your disk.

josh@yoga:~/git/merlin/egs/slt_arctic/s1$ ./run_demo.sh 
Step 5: deleting intermediate synthesis files...
synthesized audio files are in: experiments/slt_arctic_demo/test_synthesis/wav
All successfull!! Your demo voice is ready :)

Here’s our nice, cleaned up test_synthesis dir:

josh@yoga:~/git/merlin/egs/slt_arctic/s1/experiments/slt_arctic_demo$ tree test_synthesis/
test_synthesis/
    ├── gen-lab
    │   └── arctic_a0001.lab
    ├── prompt-lab
    │   └── arctic_a0001.lab
    ├── test_id_list.scp
    └── wav
        └── arctic_a0001.wav

Conclusion

I hope you found this useful!

If you have issues or comments, leave a comment below.

Resources

Here’s a tutorial on Statistical parametric speech synthesis writen by Simon King, one of the creators of Merlin.