Ossian is a collection of Python code for building text-to-speech (TTS) systems. We’ve designed Ossian to make research on TTS more efficient and at the same time more attainable for newcomers to TTS. With Ossian you can quickly script your ideas and run experiments, spending more time working on TTS and less time worrying about all the rest.
Ossian supports the use of neural nets trained with the Merlin toolkit as duration and acoustic models. You can also use the HTS toolkit to build HMM-GMM models.
Work on Ossian started with funding from the EU FP7 Project Simple4All.
All comments and feedback are very welcome! Make a fork, add to the code and send us a pull request!
Dependencies
Before we can get started with Ossian, let’s download and install its dependencies. Thankfully, there aren’t too many, and they’re all very easy to get working fast.
HTK
Ossian requires HTK to generate the alignments for training your DNN in Merlin. To get access to HTK, you need to register an account with the University of Cambridge. Thankfully, registration is really easy! Just go to this link:
http://htk.eng.cam.ac.uk/register.shtml
Fill out some information:
And accept the terms of the license:
You will receive your password in a very short email from htk-mgr@eng.cam.ac.uk that looks something like this:
The password you receive will be some random string, so if you prefer a password that isn’t gibberish, you can easily change it here: http://htk.eng.cam.ac.uk/change_pass.shtml.
At this point, don’t worry about downloading HTK yourself. Soon, we will take your beautiful new username and password and feed them into Ossian’s installation script. The installation script is designed to take care of all the downloading, compilation and formatting for HTK, so you don’t have to! This means less time worrying about file structure and more time working on speech synthesis!
Compilation Dependencies
These are some dependencies you may need for compiling Ossian on Linux:
Python Dependencies
Next we will install some Python dependencies:
That’s it! You should be good to go on dependencies.
Install Ossian
Now we’re ready to install Ossian itself! Huzzah! First let’s clone Ossian from github, then compile it.
Clone Ossian
Compile Ossian
Our configuration and compiling will be done by the ./tools/setup_tools.sh script. This script takes two arguments, HTK_USERNAME and HTK_PASSWORD.
In fact this one script will download and compile both HTK and Merlin in addition to Ossian.
This script has compiled Ossian, cloned and compiled Merlin, downloaded and compiled HTK, and put everything in the tools directory. At this point, if you didn’t run into any problems, you should have a working installation of Ossian which can call both Merlin and HTK.
Run the Demo
The CSTR folks have been nice enough to provide a small corpus for the Romanian langauge. To run the Ossian demo, we will download this corpus which is just made of audio files and their transcripts.
Get Some Data
First let’s download the CSTR corpus:
Now we need to unpack the compressed corpus:
Now let’s take a look at the file structure of this Romanian corpus. When building your own language, you should have your file structure be exactly the same.
I’m only showing a few audio (*.wav) and text (*.txt) files here, but you get the idea. The filenames for each utterance and its transcript are the same. You see we have a adr_diph1_001.wav as well as a adr_diph1_001.txt. If you have mismatches or missing files, you will have issues later on in training.
For this demo, remember that we’re working with the Romainian language, hence the main directory label rm for “Romainian”. One level down we have a speakers directory and a text_corpora directory. We can have multiple speakers (i.e. voices) per language, so you can imagine having multiple subdirs under speakers/*, one for each voice or corpus. Here the speaker dir we’re working with is labeled rss_toy_demo, but it could easily be called mary or john for the actual speaker’s name.
Next, we have a text corpus directory called text_corpora. We could have multiple text corpora (just like we could have multiple speakers), but in our case we’re just working with a sample of Romainian Wikipedia, hence the filename wikipedia_10k_words.
So, now that we’ve downloaded our data and taken a look, let’s use it to make a speech synthesizer!
Train Ossian Model
There’s a main Ossian train.py script which takes three main arguments:
the speaker dir name after the -s flag: -s speaker_dir
the language dir name after the -l flag: -l language_dir
the recipe configuration file name (without extention): naive_01_nn
Train Merlin Model
To train fancy-dancy DNNs for our speech synthesizer, we can use Merlin (built on top of Theano).
We’re going to train both an acoustic and duration model here. Since this is a Ossian demo and not Merlin, we aren’t going to get into detail on Merlin, but if you’re interested, here’s a beginner’s Merlin walkthrough.
Just run the following commands:
Store Merlin Model
Now we will take the Merlin DNNs we just made and format them for Ossian. NB - if you trained your DNNs on a GPU machine, they can only be used on a GPU machine.
We call the conversion script with:
the same config file you used for training: config.cfg
the directory name for newly formatted model: acoustic_predictor or duration_predictor
Synthesize New Audio
Now you’ve got everything in place to synthesize some speech with Ossian! We can use a sample Romainian sentence (text) provided by CSTR to make a sample as such:
And there you go! You can listen to your beautiul new Romainian speech in the file ./test/wav/romanian_test.wav.
Conclusion
That was pretty easy, yeah? The goal of Ossian is to make good research and development easier, so you can quickly test out more TTS systems. If you have ideas on how to make Ossian better, let us know:)
Resources
Here’s a tutorial on Statistical parametric speech synthesis writen by Simon King, one of the creators of Merlin.