Keyword spotting refers to the task of learning to detect spoken keywords. It interfaces all modern voice-based virtual assistants on the market: Amazon’s Alexa, Apple’s Siri, and the Google Home device. Contrarily to speech recognition models, keyword spotting doesn’t run on the cloud, but directly on the device. This sets up a natural constraint on the model size, energy consumption, and compute efficiency of the model because often the hardware devices have limited memory and limited computing power.
The prerequisite to perform such a task with African languages would be the creation of a dedicated dataset. Indeed, African languages account for 30.15% of the 7111 living languages (Orife et al. 2020), which provide great diversity. Unfortunately, they are barely represented in natural language processing (NLP) research.
The motivation of this proposal is to extend the Speech commands dataset (Warden 2018) with African languages. In particular, we are going to focus on 6 Senegalese languages: Wolof, Poular, Sérère, Mandingue, Diola, Soninké. Those are the most spoken languages in Senegal, and the first to be codified (given a written form) in 1971. There are two distinctions between a language and a dialect. In the case of Wolof, Poular, Mandingue, and Soninké there are many spoken dialects, each belonging to particular regions in Senegal.
Moreover, in this case, two speakers of different dialects within the same language may understand each other. In other words, the language is of common understanding for all the speakers of regional dialects. In the case of Sérère and Diola however, regional dialects differ dramatically, to the point that there is no common understanding for speakers of different dialects of the same language. In this case, the chosen language is the most spoken dialect within the language, the Sérère-Sine, and the Diola-Fogny.
Goals and outcomes
The Speech commands dataset (Warden 2018) included only a limited vocabulary composed of around twenty common words at its core. These included the digits from zero to nine, and seventeen words that would be useful as commands in IoT or robotics applications; “Yes”, “No”, “Up”, “Down”, “Left”, “Right”, “On”, “Off”, “Stop”, and “Go”, “Backward”, “Forward”, ”Bed”, “Bird”, “Cat”, “Dog”, “Tree”. This dataset has been collected specifically for keyword spotting referring to the task of learning to detect keywords. In short, the objective of this proposal is three-fold:
- To produce open-source software dedicated to collecting data in an African context.
- To build up an open-source extension of the Speech Commands dataset with Senegalese languages.
- To document the dataset creation process and publish it, so that it could be reproduced in other countries. The interest here is to make the process reproducible in the African context
Concept and methodology
The first goal of the project is to produce open-source software dedicated to collect speech data in the African context. To do that, we are going to extend Common Voice so that it could mirror Amazon Mechanical Turk in an African context. Indeed, Common Voice is open-source software started by Mozilla to create a free database for speech recognition software. Common Voice natively allows users to create an account and to upload speech for a given text.
The account allows us to be sure that every recording comes from a different person. In addition, it also allows for a peer to peer validation, which could help for scaling. Amazon Mechanical Turk allows researchers to collect data from the internet by paying users a small amount of money, but is closed and only works with traditional payment solutions such as credit card or Paypal. However, mobile money is pervasive in most African countries. Thus, the main challenge for this deliverable is to connect Common Voice to mobile money solutions in Senegal, but the end goal would be to allow contributions in an open-source fashion to integrate with other mobile money actors in Africa like M-PESA.
For that purpose we made a successful partnership with Baamtu. As a technical partner, they are going to take care of all the cost relative to the web application development and deployment. Thierno Diop, one of the members of GalsenAI and Lead Data scientist at Baamtu, has won the AI4D-African Language Dataset Challenge in November 2019 with this methodology.
The second deliverable is to collect a dataset composed of 1000 recordings of the vocabulary in each language. For that purpose, we have made a strong multidisciplinary partnership with a team of linguists that is going to help to translate each word of the vocabulary in the six languages:
- Wolof: Dr. Mamour Dramé Teacher Assistant, Department of Linguistics and Language Science, Faculty of Humanities, Cheikh Anta Diop University.
- Poular: Mr. Moctar Baldé, Ph.D. Student in Linguistic and Teacher Assistant, Department of Linguistic and Language Science, Faculty of Humanities, Cheikh Anta Diop University.
- Sérère : Mr. Edouard Diouf, Ph.D. Student in Linguistic and Teacher Assistant, Department of Linguistic and Language Science, Faculty of Humanities, Cheikh Anta Diop University.
- Diola-Fogny : Mr. Pascal Assine, Ph.D. Student in Linguistic and Teacher Assistant, Department of Linguistic and Language Science, Faculty of Humanities, Cheikh Anta Diop University.
- Mandingue: Dr. Mamadou Dabo, Teacher Assistant, Department of Linguistics and Language Science, Faculty of Humanities, Cheikh Anta Diop University.
- Soninké: Mr. Almamy Konaté, Consultant Translator in Soninké.The dataset is going to be accessible on the GitHub page of the open-source project.
Finally, the third deliverable is to document the whole process and publish it in a technical report.
Greater narrative and ambition
The full impact of this project can be decomposed into two dimensions. (1) An open-source software with a reproducible process to create a dataset specifically in the African context. (2) A modeling challenge: keyword spotting with African languages that focuses on a practical machine learning task in a low resource setting. We argue that keyword spotting with African languages could have a widespread adoption within the African machine learning (ML) community and is, therefore, more likely to lead to breakthroughs and original contributions from Africa.
- Low resource: Keyword spotting is a natural task to experiment with smaller models, requiring fewer resources to train (Warden 2018). It is relevant for the African ML research community who might work in limited-resource settings because of the systemic lack of funding in African research laboratories, besides, the majority of undergrad students might be under-equipped to run large models. Thus, alleviating the constraint of requiring a lot of computation and big datasets could harness the creativity of the African ML community. Indeed this trend has been acknowledged with the Practical ML for Developing Countries Workshop at ICLR 2020. We think that the current project reinforces this trend and could lead the African ML community toward academic leadership in resource constraint ML. Indeed, datasets and benchmarks have often played a significant role in pushing efforts toward discoveries, for instance, we can think of MNIST and ImageNet for that purpose.
- Relevant and practical: We think that working with African language could have a wider adoption by the African ML community because we think that people have a genuine interest in developing tools for their language due to the intrinsic social value of languages. Moreover, as Africa has a strong oral tradition (Iwuji, 1989), we think that the general public could greatly benefit from services developed with their native language, which would encourage ML developers towards more innovation, which could lead to a virtuous circle of value creation.
- Reproducibility and scalability: (1) open-source software allows any developer to adapt the software to feature local mobile money solutions. (2) Our methodology to select languages can then be used as a reference in that country to collect data for new languages. (3) The Github page of the project would then display a link to every new dataset created with the software. (4) Finally, as the validation procedure is expected to be performed in a peer to peer fashion. The data collection process is expected to scale with respect to the number of speakers willing to be recorded, and the number of people willing to validate the words from a given language.
- Jean Michel Ahmath Sarr – PhD Student in Computer Science, Department of Mathematics and Computer Science, Faculty of Science, Cheikh Anta Diop University. GalsenAI – IndabaX Senegal organizer
- Daouda Tandiang Djiba – Data Scientist – BICIS Group BNP Paribas – GalsenAI – IndabaX Senegal organizer
- Thierno Diop – Lead Data Scientist at Baamtu, cofounder of GalsenAI and Zindi Ambassador in Senegal – IndabaX Senegal organizer
- Derguene Mbaye – Engineering Student in Telecoms & networks at Ecole Superieure
Polytechnique de Dakar, GalsenAI – IndabaX Senegal organizer
- Elias waly Ba – Lead Data Scientist at Air Senegal – GalsenAI – IndabaX Senegal organizer
- Ousseynou Mbaye – PhD Student in Computer Science, Department of Applied Sciences, Communication and Technologies, Alioune Diop University of Bambey – GalsenAI – IndabaX Senegal Organizer
- Dr Mamour Dramé, Teacher