## Introduction

This post is a short introduction to installing and using the Merlin Speech Synthesis toolkit.

In the following, I will display all the commands needed to (1) install Merlin from the official GitHub repository as well as (2) run the included demo.

In addition to showing and explaining the commands needed to install and run Merlin, I will also take some time to show how the scripts work and dive into the file structure expected by Merlin. However, I won’t get into any of the algorithms behind DNNs or speech synthesis.

More specifically, I will show which files are required by scripts, which files are generated by scripts, and how the main demo script run_demo.sh proceeds from data preparation to training to synthesis.

Here is a nice, concise description of the toolkit quoted directly from the official CSTR Merlin site:

Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It must be used in combination with a front-end text processor (e.g., Festival) and a vocoder (e.g., STRAIGHT or WORLD).

The system is written in Python and relies on the Theano numerical computation library.

Merlin comes with recipes (in the spirit of the Kaldi automatic speech recognition toolkit) to show you how to build state-of-the art systems.

Merlin is free software, distributed under an Apache License Version 2.0, allowing unrestricted commercial and non-commercial use alike.

## Installation

### Clone

Like good open-source software, the Merlin toolkit is hosted on GitHub and can be easily downloaded (cloned) with a single line of code:

Easy as that! Now let’s take a peek into what we just downloaded.

We can immediately see some familiar file and directory names, like doc for documentation, egs for examples, src for source code, etc.

First things first, we should always start investigating new code with the README file:

This main README file gives us information on the toolkit, how to install it, how to work with it, and how to contact the team from the University of Edinburgh. For our current purposes, since we’re just trying to install and run the demo, let’s take a look into the INSTALL file.

So, we find that to install Merlin succesfully, we have to:

1. Follow the instructions in the tools/INSTALL file
2. Install some Python dependencies

### Compile

First we cd into the tools dir:

Then we take a look into the new INSTALL file we find here:

Simple as that! We have to just run that one compile_tools.sh script and we should be good to go (apart from Python dependencies).

So, when we run that script, we see something like this:

Since the output to the terminal was pretty long, I cut out a big part in the middle as you can see.

### Install Python Dependencies

So now that we have compiled our Merlin tools, let’s move on to the second step in the main merlin/INSTALL file where we are told to make sure we have the right Python dependencies installed. It’s not explicit, but I’m pretty sure we need Python2 and not Python3. In any case, I’ve tested this out with Python2 on my system and it seems to work ok.

So, let’s get all the dependencies in one fell swoop with the trusty pip program. This program helps us install Python packages securely and easily. You probably already have it installed, but if not, follow the simple instructions here.

We can install all the dependencies with one command:

Since I had all these packages installed already, I get the following messages to the terminal:

So now, at this point we have downloaded and compiled Merlin along with all its necessary dependencies!

Now let’s go on to run the demo.

## Run the Demo

As per the main merlin/README.md file, we are invited to begin our demo with example slt_arctic.

So, let’s cd into that directory and take a look around.

First things first, let’s look at what the README has to tell us about this example.

So, we’re going to start with the simpler demo voice as suggested.

### Pre-existing Dirs & Files

Before we just go and run that run_demo.sh, lets investigate what data and scripts we have so far so that we can get an idea of what Merlin requires and what the workflow is.

I always like to use the tree program for Linux for looking at file structures. It’s a very simple program that will list the contents of a directory recursively.

When we look at the file structure for the slt_arctic/s1 dir, before we’ve run any scripts, we see this:

So, what should we take away from the files and directories present?

##### conf

The first dir we find in slt_arctic/s1 is conf (short for configuration).

The conf dir contains one configuration file for logging (logging_config.conf), as well as one dir (dnn) which contains DNN configuration files.

In the conf/dnn dir, the DNN “conf” files define some information about the paths to relevant directories, information about the training data, and the architecture of the DNNs we want to train.

There are a total of four (4) DNN configuration files, the first two for training and the last two for testing:

In Merlin, we don’t just model the acoustics of phonemes of the language. We also model their durations. For both (1) acoustic modeling and (2) duration modeling, we use DNNs, and as such we have two configuration files as seen in the conf/dnn dir above.

Quoting from the team’s demonstration paper, they concisely describe the duration model as such:

Duration modelling Merlin models duration using a separate DNN to the acoustic model. The duration model is trained on the aligned data, to predict phone- and/or state-level durations. At synthesis time, duration is predicted first, and is used as an input to the acoustic model to predict the speech parameters.

As such, we need two DNNs, and that means we need two configuration files.

##### scripts

Moving onto the second directory within s1, we find the location of our data preparation scripts.

Logically, this directory is labeled scripts.

Let’s take a look inside:

The first four data preparation scripts in the scripts dir have very transparent filenames, so I won’t elaborate on them.

However, we should note that the main purpose of the setup.sh script is to:

2. create the main directories to house the current experiment
3. move test and train data into those directories
4. define the global configuration file

Moving on, the last located in the s1/scripts/ dir is the submit.sh script. The name is transparent once you know that this script will take any Theano job and submit it to either a GPU or CPU, depending on what you have available.

##### testrefs

Moving back up a level to the s1/ dir, the last of the three main dirs is testrefs. This dir contains only four (4) files, which are all log files from training performed by the CSTR team. These files can be used to compare against our own training in case we hit any problems.

At this point, we’ve gone over the content in each of the three dirs in the first level of the main s1 dir.

Specifically, we mentioned:

1. s1/conf: contains configuration files for building, training, and testing our DNNs
2. s1/scripts: contains scripts for preparing data and submitting Theano jobs
3. s1/testrefs: contains log files from the CSTR team for our reference

### Run the run_demo.sh Script

Now that we’ve gone over our pre-existing dirs, let’s go on to the run script and start running the demo.

The s1/README.md file directs us to run the run_demo.sh script first, so I will start there.

So, walking through the run_demo.sh script, the first thing we see is three scripts in the data prep stage:

If we take the time to run these scripts one-by-one, we can more easily see what’s going on. I’m going to run each script here, and look into what’s being output by each one.

#### Prepare Data & Create Config File

setup.sh does two important things:

1. download, unzip, and move training and testing data into new experiments dir
2. create global configuration file, global_settings.cfg, in our conf dir

So, beginning by running only the setup.sh script (from within run_demo.sh), we get the following output to the terminal:

After running setup.sh, if we look into the file structure of the s1 dir (for just the first level, hence the -L 1 flag), we can see we’ve created a few new things.

I’ve put asterisks (**) around the newly created dirs and file:

Specifically, we have created the following:

1. experiments dir
2. slt_arctic_demo_data dir
3. slt_arctic_demo_data.zip compressed file

What happened was the following:

1. the relevant data files were downloaded from Srikanth Ronanki’s homepage, hence the slt_arctic_demo_data.zip file
2. this compressed file was uncompressed and saved as slt_arctic_demo_data
3. the relevant files were copied from slt_arctic_demo_data into the new experiments directory

As such, at this point the experiments dir contains mostly just extracted audio feature files.

Here’s what we find in this new experiments dir:

In the above output I’ve omitted displaying all feature files except for arctic_a0001 because there’s a lot of them.

Specifically, there are 430 feature files in total.

Since there’s a lot of stuff going on in the experiments dir, I think it’s worth the time to briefly explain what we have downloaded in terms of audio feature, label, and utterance ID files.

#### File Formats in experiments Dir

In terms of the file formats, we find the following:

1. *.bap: band a-periodicities
2. *.lab: label files (time-to-phone alignments)
3. *.lf0: log-fundamental frequencies
4. *.mgc: generalized cepstral coefficients
5. *.scp: script file for filenames

#### *.bap

The first file type, *.bap, is a kind of feature extracted from the audio, and we have one file for every audio file in our data set. We look into the *.bap file itself with the help of the x2x program which is included in Merlin via SPTK (Speech Signal Processing Toolkit). In the following command we use +fa to convert our binary floats (“f”) to ASCII (“a”).

The only reason I’m using head and tail here is to get a section of the file which has numbers other than zeros, so it’s more interesting.

#### *.lab

The second file type in the experiments dir is *.lab. These files are the “label” files which contain alignments for either phones or states to our audio files from our data set. We have two kinds of files here: (1) for phoneme alignments, and (2) for state alignments.

Quoting from email communication with Srikanth Ronanki (one of the Merlin developers):

To train a voice using Merlin, you either require phone-level or state-level alignments but not both. This option is configurable in global_settings.cfg.

Only very few international languages have enough resources to derive state-level alignments using HMMs – therefore Merlin is configured to even run with phone-level alignments which can be manually-labelled. But, the default demo runs using state-level alignments.

First, for phoneme alignments, we see something like this:

For state-level alignments, we get the following:

For a little sanity check, we can see that the two *.lab files for the same audio file have a different number of lines.

Each line in a *.lab file represents a single alignment in time. As such, we would expect that for a given audio file, its state alignment .lab file will have more lines than its phoneme .lab file. This is because one phoneme is made up of multiple states in HMM-based ASR (ie. the machinery used to generate alignments).

We’d have to look more into the details of the acoustic model used to do forced alignment to generate the labels, but the difference between number of lines in the alignment files are around the expected numbers:

In this case, we have for the audio file arctic_a0001.lab a total of 37 phonemes and 185 states. This comes out to 185/37 = 5 states per phoneme.

#### *.lf0

The *.lf0 files are the log-fundamental frequency files, aka, another kind of feature file extracted from our audio files in our data set.

Again here, we’re using the x2x program to get something human-readable from the feature file:

I’ve deleted a bunch of empty lines in the above output so as not to take up so much space.

#### *.mgc

Next, we move onto our next feature file type: *.mgc. These files contain the generalized cepstral coefficients for our audio files in our data set. Again, this is not very human readable:

#### *.scp

Moving on to the next file type, the *.scp files here, I’m guessing, are similar to Kaldi’s *.scp files. These files are “script” files which typically contain lists of information. In this case, the experiments dir contains two *.scp files which contain lists of file ids. For example:

Now that we’ve gone through the experiments dir and the format of its file contents, here’s a short recap of where we are:

We’ve just run the setup.sh script. This script downloaded, unzipped, and formatted the data we need for training and testing. The data we need includes (1) various audio feature files, (2) label files, and (3) utterance lists.

#### global_settings.cfg

We shouldn’t forget that the setup.sh script created a new, global configuration file in the conf dir.

This new configuration file is called global_settings.cfg.

This global config file (conf/global_settings.cfg) will contain information on where the compiled Merlin programs are located, where the current working dir is, what kind of Vocoder we’re using, and how many files to use for training and testing.

That’s all for setup.sh!

### Prepare Training Config Files

Now we’re going to run just the prepare_config_files.sh script from within run_demo.sh, just like we did above with setup.sh), and take a look at what it does.

This script produces two configuration files for training:

1. duration_slt_arctic_demo.conf
2. acoustic_slt_arctic_demo.conf

When we run the script, we only get two lines of output to the terminal:

Sure enough, as promised by our two above messages, when we look into our pre-existing conf directory, we find two new configuration files:

These two new configuration files are of the same form as the original configuration files we found pre-existing when we cloned Merlin from GitHub.

They define architecture and training procedure for the acoustic model DNN and the duration model DNN.

Pretty straightforward overall.

### Prepare Testing Config Files

This section deals with the prepare_config_files_for_synthesis.sh script called from run_demo.sh.

In the previous script we created the configuration files for our training procedure, now we do the same for our testing (aka aynthesis) phase.

Just like before, we get two lines of output to the terminal.

Also, just like before, we have two new files in our conf dir:

### Train the Duration Model

If you’ve gotten here and followed all the previous steps, we’re ready to finally start training our DNNs… huzzah!

You’ll see in the run_demo.sh script that after the data preparation phase is completed, we send a Theano job to our processing unit (CPU or GPU) and we specify that we want to train the duration model:

As you can see, what happens here is that the main Merlin run script run_merlin.py is called and given its argument, duration_\${Voice}.conf, a configuration file for the duration DNN model.

This run_merlin.py script is then fed into another script, submit.py where it is processed as a job and sent to your computer’s CPU(s) or GPU(s).

As such, the main script of interest in this training step is run_merlin.py.

This is a very clearly written Python script, and I invite you to take a look yourself into the code. Here I will merely show you the main train function and its arguments and defaults:

Since this step in run_demo.sh produces a lot of output, I decided to record my terminal session and embed it here, for those interested in the details.

After the training is complete, we can see that we’ve produced some new dirs and files in our experiments/duration_model directory:

Here’s our duration_model dir before training:

And here’s our duration_model dir after training:

#### nnets_model

I’ll let you go through the different dirs and files, but I will point out the most important file, the trained DNN model.

This new, shiny DNN can be found in the nnets_model dir and, in my case, the filename is:

The filename is long, but it contains some very important information.

1. activation functions used in our DNN layers: TANH_TANH_TANH_TANH_LINEAR
2. model type, acoustic or duration: dur
3. True or False, whether we have a multi-stream model: 0
4. number of hidden layers: 4
5. dimensionality of hidden layers: 512_512_512_512
6. number of input nodes (i.e. the dimensionality of labels (phonemes) in our model): 416
7. number of output nodes (i.e. the dimensionality of acoustic features to predict): 5
8. the number of utterances used in training: 50
9. the learning rate: 0.002000

### Train the Acoustic Model

The process for training the acoustic model is exactly the same as that for training the duration model, the only difference being the configuration file which is used to initialize and train the DNN.

As you can see from the code snippet below from run_demo.sh, we send a .conf file to run_merlin.py, and then use submit.sh to send the job to our processor(s).

This process runs a lot longer than training the duration model. That’s because our output layer is much larger, and we have a lot more parameters in our DNN to train.

This took about an hour and a half on my laptop to train.

Now if we look into our experiments dir to see what new things we’ve added to the acoustic_model subdir, we see a similar picture to what we saw with the duration_model subdir.

First, let’s look at the acoustic_model subdir before training the DNN:

And here’s what we get after we run the training script:

If you’ve made it to this point, you have completed training and have all you need to start synthesizing some speech!

### Synthesize Speech

So let’s start synthesizing some speech now.

Just like in training, we use the main run_merlin.py script for synthesis:

Thankfully, this step takes almost no time at all as you’ll see in the terminal session below. For this demo, we only are synthesizing 10 audio files:

Let’s take a look as to what this script has created by comparing our test_synthesis dir before and after running this step:

Before running the synthesis script:

After running the duration synthesis script:

Now we run the acoustic synthesis script:

This synthesis step takes longer than the duration synthesis step, but it is still relatively quick even on my laptop.

Here’s what we find in the test_synthesis dir after we’ve generated our wav files:

### Clean up

At this point, we’ve trained our DNNs and used them to generate new speech in the form of WAV files. We could very well stop now and just take our WAV files and do with them whatever we want.

However, in the offical Merlin demo, they add an extra step at the end that helps saves you some room on your disk.

Here’s our nice, cleaned up test_synthesis dir:

## Conclusion

I hope you found this useful!