How we added Kyrgyz to Mozilla's Common Voice project

Introduction

Mozilla is helping build Kyrgyz voice technology (for free) via its Common Voice project. Anyone can go to the Common Voice website and record sentences for the project. We need as many speakers and accents as possible in order to create robust technologies. Donate your voice now.

What is Common Voice?

Common Voice is a data collection project from Mozilla, focused on collecting free and open-source data for speech recognition systems. To build a working speech recognition system (such as Siri or OK Google) the developer must first train the computer to understand how words in a language are pronounced. The system must be able to distinguish sounds like vowels and consonants, and to accomplish this you need lots of audio data.

Mozilla is crowd-sourcing this data collection with the Common Voice project, which allows anyone to record their voice and upload it to the cloud. At any time, developers can download these collections of recordings and train their own speech technologies for any kind of application. For instance, Google could use this data to create a Kyrgyz-language voice assistant for Android phones, or Namba taxi could use this data to make a voice-powered iPhone app for Kyrgyz speakers. Anyone can download the data and use it for whatever project they wish!

Why is this important?

Enabling Kyrgyz data collection to Mozilla’s Common Voice project is important because it will spur innovation in the Kyrgyz technology sector. Voice technologies are often more comfortable to use than typing, and for many people (such as people who are blind or handicapped) these technologies are essential to normal living. Voice technologies are widely used already for European languages such as English, French and Spanish because the appropriate datasets (i.e. large collections of voice recordings) already exist.

However, the time and money required to create one of these datasets is a major hindrance for a new language, and Mozilla has developed a method of crowd-sourcing data collection which makes data collection much faster and free. Mozilla provides all the recordings under the Creative Commons CC-0 license, so the data is free to use for any purpose. This open license ensures that small companies have access to the same cutting-edge technologies as technology giants like Google.

How did we do it?

We took two main steps to add Kyrgyz to the Common Voice project:

Translate the user interface and information into Kyrgyz
Collect text sentences which users will read out-loud

The Common Voice team itself, lead by Michael Henretty and Kelly Davis pushed the Kyrgyz language version into production. In addition, Francis Tyers (computational linguist and language activist) aided in team coordination and translation oversight. Josh Meyer processed the text from Kloop.kg and helped coordinate the various teams.

Translation

In order to translate the user interface into Kyrgyz, a team of contributors worked together to ensure that the translations are natural-sounding and accurate. This team includes Chorobek Saandanbek (director of the Bizdin Muras Foundation), Saikal Maatkerimova (Kyrgyz language instructor at Lingua Yurt), and Talgat Subanaliev (recent AUCA graduate). As the interface evolves and expands, Saandanbek will lead the team to ensure that the translation is accurate and up-to-date.

Mozilla has enabled crowd-sourcing of the interface translation itself, such that anyone can propose a new translation if they identify an error in translation. Kyrgyz speakers can help with Common Voice translation via the Pontoon system. Once a new translation is proposed, the team leader (i.e. Saandanbek) will review and accept or defer the translation. In this way, problems are found quickly and resolved appropriately.

Text Collection

To create a dataset for training speech technologies, collecting voices isn’t enough. We need to know what was said in every recording so that the computer can recognize words from the audio. As such, Mozilla has devised a system to display a text sentence on the screen, and then the speaker reads the sentence out loud so that each recording is saved along with the text. These sentences are difficult to find, because they must be under the Create Commons license CC-0 so that Mozilla may freely distribute the text sentences and audio recordings together.

Currently, all Kyrgyz text sentences used for this project come from the well-known Kyrgyz language news source Kloop.kg. The founder of Kloop.kg, Bektour Iskender - a proponent of an open-internet and the Create Commons - allowed use of Kyrgyz language articles from Kloop to be distributed under CC-0. As such, when the user reads a sentence for Kyrgyz Common Voice, they are actually reading news from Kloop.kg. This is a major win for the Kyrgyz language and the open internet, because finding CC-0 text for Common Voice is typically the most difficult task in adding a new language. At least 5,000 different sentences should be initially recorded, and most books and online news (such as BBC Kyrgyz) are not available under CC-0.

After the text was automatically downloaded from Kloop (via this Python script), the text was cleaned (all foreign words, numbers, abbreviations were removed) and sentences of an appropriate length were selected. Ideally each recording should be about 5 seconds long. More text can be added later, such that there is more diversity in the kinds of sentences read. Diversity is important for Common Voice, because good speech technologies should recognize the speech of people speaking with different accents about different topics.

Donate Your Voice!

In order for quality technologies to be created for the Kyrgyz language, we need more voices!

Anyone can record and donate sentences for the Common Voice project, and the more voices we get, the more accurate the technology becomes.

Donate your voice today!