Building Wolof Text To Speech System | Knowledge 4 All Foundation Ltd.

The objective of this project is to build a Wolof text-to-speech system. Three people will be involved Thierno Ibrahima DIOP, senior data scientist at Baamtu SARL, El Hadj Mamadou Nguer, Assistant Professor at Universite Virtuelle du Senegal, and Sileye BA, Senior machine learning researcher at L’Oreal Innovation Center, in Paris. Thierno Ibrahima DIOP, and Mamadou Nguer will be the project’s principal investigators.

The project will exploit a dataset of 40000 Wolof phrases uttered by two actors. This open-source dataset is a deliverable of a previous project.

The project will be conducted following four phases:
1. Evaluation of the quality of the dataset
2. Implementation of a machine learning model mapping Wolof texts into their
corresponding utterances
3. Quantitative and qualitative evaluation of the implemented model’s performances
4. Development of and API exposing implemented text to speech model

Database quality will be assessed on a randomly sampled portion of about a thousand uttered phrases. These phrases will be qualitatively validated in terms of comprehensiveness by fluent Wolof speakers.

A state of the art in neural network speech synthesizer will be implemented and evaluated using the dataset. Neural network models have been selected as they can be trained end to end without requiring word segmentation at the phoneme level as required by competing statistical models. We will investigate Text-to-Spectrogram models such as Tacotron, Glow-TTS, Speedy-Speech, and also Vocoders models such as MelGAN.

The trained model will be evaluated quantitatively and qualitatively. The quantitative evaluation will be done using metrics provided in standard text to speech evaluation libraries. The qualitative evaluation will be based on fluent Wolof speakers’ comprehension of synthesized Wolof utterances.

The model will be exposed via an API which will take as input a language token and input text, and returns the synthesized input text into an audio file. This API will be plugged to à web platform based on the Masakhane MT web platform.

For the deployment a kubernetes cluster will be used to have a horizontal scaling, in the beginning, we can have only one instance, and depending on the load, the number will be automatically adjusted. The cost of an instance (8 cores, 32GB of RAM) will be about $83.95 per Month subject to a yearly reservation basis.

An objective of this project is to publish work done on the dataset, and the developed speech synthesis model in a natural language processing conference such as African NLP Workshop, or Deep Learning Indaba. This will give more visibility to this work, and at the same time advances machine learning based African language processing activities.