Description

Namibia is home to 2.5 million people with a rich cultural and colonial history spanning over 100 years.

The stories of the Namibian people have not been told with regards to their cultural practises, knowledge, nor its history from the perspectives of the Namibian people. As Goring said at the Nuremberg trials “The victor will always be the judge, and the vanquished the accused.”

As such, this project aims to capture this knowledge in the historical and cultural context, for one of the most critically endangered languages, Khoekhoegowab and the Namibian most widely spoken, Oshiwambo — and in doing so provide data for NLP tasks.

This project builds on prior efforts to create cultural and historical texts in the khoekhoegowab language, by crowdsourcing a speech dataset from 300 war veterans from a potential 10000 Namibian war veterans, mostly Oshiwambo speaking and a community of Khoekhoegowab elders, whose traditional methods are still used in wildlife conservation, for monitoring and tracking.

The project will consider various data gathering methods such as interviews, focus groups and web apps to capture the data. The speech data will be annotated and translated into English

Introduction

When it comes to scientific communication and education, language matters. The ability of science to be discussed in local indigenous languages not only has the ability to reach more people who do not speak English or French as a first language, but also has the ability to integrate the facts and methods of science into cultures that have been denied it in the past. As sociology professor Kwesi Kwaa Prah put it in a 2007 report to the Foundation for Human Rights in South Africa, “Without literacy in the languages of the masses, science and technology cannot be culturally-owned by Africans. Africans will remain mere consumers, incapable of creating competitive goods, services and value-additions in this era of globalization.”

During the COVID19 pandemic, many African governments did not communicate about COVID19 in the most wide-spread languages in their country. ∀ et al (2020) demonstrated that the machine translation tools failed to translate COVID19 surveys since the only data that was available to train the models was religious data. Furthermore, they noted that scientific words did not exist in the respective African languages.

Thus, we propose to build a multilingual parallel corpus of African research, by translating African preprint research papers released on AfricArxiv into 6 diverse African languages.

Proposed Dataset and Use Cases

When it comes to scientific communication, language matters. Jantjies (2016) demonstrates how language matters when it comes to STEM education: students perform better when taught mathematics in their home language. Language matters, in scientific communication, in how it can dehumanise the people it chose to study – Robyn Humphreys, at the #LanguageMatters seminar at UCT Heritage 2020, noted the following “During the continent’s colonial past, language – including scientific language – was used to control and subjugate and justify marginalisation and invasive research practices”.

The ability of science being discussed in local indigenous languages not only has the ability to reach more people who do not speak English as a first language, it also has the ability to integrate the facts and methods of science into cultures that have been denied it in the past.

As sociology professor Kwesi Kwaa Prah put it in a 2007 report to the Foundation for Human Rights in South Africa, “Without literacy in the languages of the masses, science and technology cannot be culturally-owned by Africans. Africans will remain mere consumers, incapable of creating competitive goods, services and value-additions in this era of
globalization.” (Prah, Kwesi Kwaa, 2007). When science becomes “foreign” or something non-African, when one has to assume another identity just to theorize and practice science, it’s a subjugation of the mind – mental colonization.

There is a substantial amount of distrust in science, in particular by many black South Africans who can cite many examples of how it has been abused for oppression in the past. In addition, the communication and education of science was weaponized by the oppressive apartheid government in South Africa, and that has left many seeds of distrust in citizens who only experience science being discussed in English.

Through government-funded efforts, European derived Languages such as Afrikaans, English, French, and Portuguese, have been used as vessels of science, but African indigenous languages have not been given the same treatment. Modern digital tools like machine learning
offer new, low-cost opportunities for scientific terms and ideas to be communicated in African indigenous languages.
During the COVID19 pandemic, many African governments did not communicate about COVID19 in the most wide-spread languages in their country. ∀ et al (2020) demonstrated the difficulty in translating COVID19 surveys since the only data that was available to train the models was religious data. Furthermore, they noted that scientific words did not exist in the respective African languages.

Use cases:

  • A machine translation tool for AfricArxiv to aid translation of their research to and from African languages
  • Terminology developed will be submitted to respective boards for addition to official language glossaries for further improvements to scientific communication
  • A machine translation tool for African universities to ensure accessibility of their publications
  • A machine translation tool for scientific journalists to assist in widely distributing their work on the African continent
  • A machine tool to aid translation of impactful STEM University curricula into African languages

Personnel

Jade Abbott is the co-founder of Masakhane and Staff Engineer at Retro Rabbit South Africa, working primarily in NLP with an MSc in Computer Science from the University of Pretoria. She is a thought leader in the space of NLP in production, African NLP (especially machine translation) and has published and spoken at numerous conferences across the world, including the Deep Learning Indaba, ICLR 2020,and the UN World Data Forum. In 2019, she co-founded and leads Masakhane – an initiative to spur NLP research in Africa, which have collectively published over 15 works in the past year and are leading the conversation around geographic and language diversity in NLP in Africa

Dr. Johanna Havemann is a trainer and consultant in [Open] Science Communication and [digital] Science Project Management and AfricArxiv. Her work experience covers NGOs, a science startup and international institutions including the UN Environment Programme. With a focus on digital tools for science and her label Access 2 Perspectives, she aims at strengthening global science communication in general – and with a regional focus on Africa – through Open Science. For the past two years, she has laid an additional focus on language diversity in Science and the pan-African Open Access portal coordinated provides information and accepts submissions in 12 official African languages.

Sibusiso Biyela has been a science communicator at ScienceLink since 2016, where he has worked with South African universities and international research institutions to produce science communication content for many audiences that include policymakers, the research
community, and the lay public. He has experience as a thought leader on the decolonisation of science and science communication. He has given talks on the topic at international conferences, contributing to discussions on platforms such as national radio and international
podcasts. He is the author of a widely regarded article; “Decolonizing Science Writing in South Africa” in which he has been vocal about creating scientific terms in the isiZulu language.

Introduction

Kenyan author Ngugi Wa Thiong’o in his novel Decolonising the Mind states “The effect of a cultural bomb is to annihilate a people’s belief in their names, in their languages, in their environment, in their heritage of struggle, in their unity, in their capacities and ultimately in themselves.”. When a technology treats something as simple and fundamental as your name as an error, it in turn robs you of your personhood and reinforces the colonial narrative that you are other.

Named entity recognition (NER) is a core NLP task in information extraction and NER systems are a requirement for numerous products from spell-checkers to localization of voice and dialogue systems, conversational agents, and that need to identify African names, places and people for information retrieval.

Currently, the majority of existing NER datasets for African languages are WikiNER which are automatically annotated, and are very noisy since the text quality for African languages is not verified. Only a few African languages have human-annotated NER datasets. To our knowledge, the only open-source Part-of-speech
(POS) datasets that exist are a small subset of languages in South Africa, and Yoruba, Naija, Wolof and Bambara (Universal Dependencies).

Pre-trained language models such as BERT and XLM-RoBERTa are producing state-of-the-art NLP results which would undoubtedly benefit African NLP. Beyond the direct uses, NER also is a popular benchmark for evaluating such language models. For the above reasons, we have chosen to develop a wide-spread POS and NER corpus for 20 African languages based on news data.

Personnel

Peter Nabende is a Lecturer at the Department of Information Systems, School of
Computing and Informatics Technology, College of Computing and Information Sciences, Makerere University. He has a PhD in Computational Linguistics from the University of Groningen, The Netherlands. He has conducted research on named entities across several writing systems and languages in the NLP subtasks of transliteration detection and generation. He has also conducted experimental research on an NLP main task of machine translation between three low resourced indigenous Ugandan languages (Luganda, Acholi, and Lumasaaba) and English using statistical and neural machine translation methods and tools such as moses and opennmt-py. He has supervised the creation of language technology resources involving another three Ugandan languages (a Lusoga-English parallel corpus and Grammatical Framework (GF)-based computational grammar resources for Runyankore-Rukiga and Runyoro-Rutooro).

Jonathan Mukiibi is a Masters student in Computer Science at Makerere University. His current research focuses on topic classification of speech documents for crop disease surveillance using Luganda language radio data. He is the coordinator of natural language processing tasks at the Artificial Intelligence Lab, Department of Computer science, Makerere University.

David Ifeoluwa Adelani (an NLP Researcher, https://dadelani.github.io/) is a doctoral student in computer science at Saarland University, Saarbrücken, Germany. His current research focuses on the security and privacy of users’ information in dialogue systems and online social interactions. He is also actively involved in the development of natural language processing datasets and tools for low-resource languages, with a special focus on African languages. He was involved in the creation of the first NER dataset for Hausa [Hedderich et al., 2020] and Yoruba [Alabi et al., 2020] in the news domain.

Daniel D’souza has an MS in Computer Science ( Specialization in Natural Language
Processing ) from the University of Michigan, Ann Arbor. He currently works as a Data Scientist at ProQuest LLC.

Jade Abbott has an MSc in Computer Science from the University of Pretoria. She is a
Machine Learning lead at Retro Rabbit South Africa, working primarily in NLP. Additionally, she co-founded Masakhane – an initiative to spur NLP research in Africa and has widelypublished in African NLP tasks.

Olajide Ishola has an MA in Computational Linguistics. He is one of the pioneers of the first dependency treebank for the Yoruba language [Ishola et. al, 2020]. His interest lies in corpus development and NLP for indigenous Nigerian languages.

Constantine Lignos is an Assistant Professor in the Department of Computer Science at Brandeis University where he directs the Broadening Linguistic Technologies lab. He received his PhD from the University of Pennsylvania in 2013. His research focus is the construction of human language technology for previously-underserved languages. He has worked on named entity annotation and system creation for Tigrinya and Oromo, and additionally developed entity recognition systems for Amharic, Hausa, Somali, Swahili, and Yoruba. He has also worked on natural language processing tasks for other African languages, including cross-language information retrieval for Somali and information extraction for Nigerian English.

The objective of this project is to build a Wolof text-to-speech system. Three people will be involved Thierno Ibrahima DIOP, senior data scientist at Baamtu SARL, El Hadj Mamadou
Nguer, Assistant Professor at Universite Virtuelle du Senegal, and Sileye BA, Senior machine learning researcher at L’Oreal Innovation Center, in Paris. Thierno Ibrahima DIOP, and
Mamadou Nguer will be the project’s principal investigators.

The project will exploit a dataset of 40000 Wolof phrases uttered by two actors. This open-source dataset is a deliverable of a previous project.

The project will be conducted following four phases:
1. Evaluation of the quality of the dataset
2. Implementation of a machine learning model mapping Wolof texts into their
corresponding utterances
3. Quantitative and qualitative evaluation of the implemented model’s performances
4. Development of and API exposing implemented text to speech model

Database quality will be assessed on a randomly sampled portion of about a thousand uttered phrases. These phrases will be qualitatively validated in terms of comprehensiveness by fluent
Wolof speakers.

A state of the art in neural network speech synthesizer will be implemented and evaluated using the dataset. Neural network models have been selected as they can be trained end to
end without requiring word segmentation at the phoneme level as required by competing statistical models. We will investigate Text-to-Spectrogram models such as Tacotron, Glow-TTS,
Speedy-Speech, and also Vocoders models such as MelGAN.

The trained model will be evaluated quantitatively and qualitatively. The quantitative evaluation will be done using metrics provided in standard text to speech evaluation libraries. The
qualitative evaluation will be based on fluent Wolof speakers’ comprehension of synthesized Wolof utterances.

The model will be exposed via an API which will take as input a language token and input text, and returns the synthesized input text into an audio file. This API will be plugged to à web
platform based on the Masakhane MT web platform.

For the deployment a kubernetes cluster will be used to have a horizontal scaling, in the beginning, we can have only one instance, and depending on the load, the number will be
automatically adjusted. The cost of an instance (8 cores, 32GB of RAM) will be about $83.95 per Month subject to a yearly reservation basis.

An objective of this project is to publish work done on the dataset, and the developed speech synthesis model in a natural language processing conference such as African NLP Workshop, or Deep Learning Indaba. This will give more visibility to this work, and at the same time advances machine learning based African language processing activities.

 

Introduction

Wildlife tourism is a significant and growing contributor to the economic and social development in the African region through revenue generation, infrastructure development and job creation. According to a recent press release by the World Travel and Tourism Council [1], travel and tourism contributed $194.2 billion (8.5% of GDP) to the African region in 2018 and supported 24.3 million jobs (6.7% of total employment). Globally, travel and tourism is a $7.6 trillion industry, and is responsible for an estimated 292 million jobs [2]. Tourism is also one of the few sectors in which female labor participation is already above parity, with women accounting for up to 70% of the workforce [2].

However, the wildlife tourism industry in Africa is increasingly threatened by rising human population and wildlife crime. As poaching becomes more organised and livestock incursions become frequent occurrences, shortages in ranger workforce and shortcomings in technological developments in this space have put thousands of species at risk of endangerment, and threaten to collapse the wildlife tourism industry and ecosystem.

Tourism in Kenya contributed a revenue of $1.5 billion in 2018 [3]. And The National Wildlife Conservation Status Report, 2015 – 2017 [4] presented by the Ministry of Tourism and Wildlife of Kenya claimed that wildlife conservancies in Kenya supported over 700,000 community livelihoods. The recession of the wildlife tourism industry could therefore have major adverse economic and social impacts on the country. It is thus critical that sustainable solutions are reached to save the wildlife tourism industry, and further research is fuelled in this area.

Problem definition

According to The National Wildlife Conservation Status Report, 2015 – 2017 [4] presented by the Ministry of Tourism and Wildlife of Kenya, there is currently a shortage of 1038 rangers, from the required 2484 rangers in Kenyan national parks and reserves, a deficit of over 40%. To address shortages in ranger workforce, carry out monitoring activities more effectively, and detect criminal or endangering activities with greater precision, we propose the deployment of Unmanned Ground Vehicles (UGVs) for intelligent patrol and wildlife monitoring across the national parks and reserves in Kenya.

The UGVs would be fitted with a suite of cameras and sensors that would enable it to navigate autonomously within the parks, and run multiple deep learning and computer vision algorithms that can carry out numerous monitoring activities such as detection of poaching, livestock incursions, human wildlife conflict, distressed wildlife, and species identification.

The UGVs could be monitored from a central surveillance system, where alerts can be generated on detection of any alarming activity, and rangers dispatched to respond. Ethical considerations can be made to facilitate the deployment of these UGVs in a manner that aids the ranger workforce in their routine surveillance tasks throughout the national parks and reserves that often span thousands of square kilometers, rather than replace them. Sustainable and ethical automation could help create more jobs in the automotive and technology sectors without replacing current jobs.

The deployment of a project of this scale, however, would require significant investments in building the UGV, and require feasibility studies from the government and international wildlife conservation bodies. Furthermore, without reasonable computer vision and autonomous navigation accuracies, investments towards building the unmanned vehicle would be futile. It is thus crucial that efforts are first made towards solving the computer vision and autonomous navigation challenges posed by the rough terrains prevalent in national parks and reserves.

This project therefore serves as a stepping-stone towards adopting autonomous vehicle technology in Africa and pioneering further research in the field and its applications to broader areas beyond just transportation. Additionally, its adaptation in national park environments would allow it to be tested in unstructured environments lacking road infrastructure and free of traffic and pedestrians, thus allowing the systems to be tested safely and get quicker policy approvals. The scope of this research is hence limited to developing an end-to-end deep learning model that can autonomously navigate a vehicle over dirt roads and challenging terrain that is present in national parks and reserves.

The model will be trained on trail video as well as driving data such as steering wheel angle, speed, acceleration, and Inertial Measurement Unit (IMU) data. The accuracy of the model will be measured by calculating the error rate between the model’s prediction and the driver’s actual inputs over a given distance. We also look to publish the dataset of annotated driving data from national parks and reserves, the first of its kind, to encourage further research in this space. Additionally, we shall collect metadata such as number of patrol vehicles per square kilometer, average distance travelled per vehicle per day, distance of traversable road in the park per square kilometer, that can be used to give a preliminary analysis on the feasibility of the project results towards automated wildlife patrol.

References

[1] “African tourism sector booming – second-fastest growth rate in the world”, WTTC press release, Mar. 13, 2019. Accessed on Jul. 11, 2019. [Online]. Available:
https://www.wttc.org/about/media-centre/press-releases/press-releases/2019/african-tourism-sector-booming-second-fastest-growth-rate-in-the-world/
[2] “Supporting Sustainable Livelihoods through Wildlife Tourism”, World Bank Group, 2018.
[3] “Tourism Sector Performance Report – 2018”, Hon. Najib Balala, 2018.
[4] “The National Wildlife Conservation Status Report, 2015 – 2017”, pp. 131, 74, 75 Ministry of Tourism and Wildlife, Kenya, 2017.

 

Abstract

According to the Open Data Barometer by the World Wide Web Foundation, countries in sub-Saharan Africa are ranked poorly with an average score of about 20 out of a maximum of 100 on open data initiatives based on readiness, implementation, and impact [1]. To make the processing of creation, introduction, and passage of parliamentary bills a force for public accountability, the information needs to be easier to analyze and process by the average citizen.

This is not the case for most of the bills introduced and passed by parliaments in Sub-Saharan Africa. In this work, we present a method to overcome implementation barrier. For the Nigerian parliament, we used a pre-trained optical character recognition tool (OCR), natural language processing techniques and machine learning algorithms to categorize congress bills. We propose to improve the work on the Nigeria parliamentary bills by using text detection models to build a custom OCR tool. We also propose to extend our method to three other African countries:  South Africa, Kenya, and Ghana.

Introduction

Given the challenges and precariousness facing developing and underdeveloped countries, the quality of policymaking and legislation is of enormous importance. This legislation can be used to impact the success of some of the United Nations Sustainable Development Goals (SDGs) like poverty alleviation, good public health system, quality education, economic growth and, sustainability. Targets 16.6 and 16.7 from the UN SDGs is to “develop effective, accountable, and transparent institutions at all levels” and to “ensure responsive, inclusive, participatory and representative decision making at all levels” [2]. For countries in Sub-Saharan Africa to meet this target, an open data revolution needs to happen at all levels of government and more importantly, at the parliamentary level.

Objectives and Expectations

To achieve the goal of meeting the UN SDG targets 16.6 & 16.7, making effective use of data is key. However, does such data currently exists? If so, how should it be organized in a framework that is amenable to decisionmaking process? Here, we propose expanding our work on categorizing parliamentary bills in Nigeria using Optical Character Recognition (OCR), document embedding and recurrent neural networks to three other  countries in Africa: Kenya, Ghana, and South Africa.

We also plan to improve our text extraction process by training a custom OCR using AI. The objective of this project is to generate semantic and structured data from the bills and in turn, categorize them into socio-economic driven labels. We plan to recruit three interns to work on this project for five months: two machine learning and one software engineering interns.

Conclusion and Long Term Vision

Our initial experimental results show that our model is effective for categorizing the bills which will aid our large scale digitization efforts. However, we identified a key remaining challenge based on our results. The output from the pre-trained OCR tool is not generally a very accurate representation of the text in the bills, especially for the low-quality PDFs. A fascinating possibility is to solve this by training our custom OCR which we proposed. The intensive acceleration of text detection research with novel deep learning methods can help us in this area.

Methods such as region-based or single-shot based detectors can be employed. In addition to this, we plan to use image augmentation to alter the size, background noise or color of the bills. A large scale annotation effort of the texts can be as the labels for us to train our custom OCR for text identification and named entity recognition. We are also extending our methodology to other countries in Sub-Saharan Africa. Results that lead to accurate categorization of parliamentary bills are well-positioned to have a substantial impact on governmental policies and on the quest for governments in low resource countries to meet the open data charter principles and United Nation’s sustainability development goals on open government.

Also, it can empower policymakers, stakeholders and governmental institutions to identify and monitor bills introduced to the National Assembly for research purposes and facilitate the efficiency of bill creation and open data initiatives. We plan to design an intercontinental tool that combines information from all bills and categories and make them easily accessible to everyone. For our long term vision, we plan to analyze documents on parliamentary votes and proceedings to give us more insight into legislative debates and patterns.

Description

Algorithms for text classification still contain some open problems for example dealing with long pieces of texts and with texts in under-resourced languages.

This challenge gives participants the opportunity to improve on text classification techniques and algorithms for text in Chichewa. The texts are of varying length, some being quite long and will pose some challenges in chunking and classification. The texts are made up of news articles.

The objective of this challenge is to classify news articles.

We hope that your solutions will illustrate some challenges and offer solutions.

Algorithms for text classification have come a long way, but classifying long texts and working with under-resourced languages can still pose difficulties. This challenge gives participants the opportunity to improve on text classification techniques and algorithms for text in Chichewa. The texts are made up of news articles or varying lengths. The objective of this challenge is to classify these articles by topic. We hope that your solutions will illustrate some challenges and offer solutions.

Chichewa is a Bantu language spoken in much of Southern, Southeast and East Africa, namely the countries of Malawi and Zambia, where it is an official language, and Mozambique and Zimbabwe where it is a recognised minority language.

tNyasa Ltd Data Science Lab

We are a company based in Malawi offering intelligent technological solutions for the travel, technology, trade, cultural and education sector in Malawi. Part of the data Science Lab we work on language tools for Chichewa such as the construction and curation of data sets, speech to text and information processing.

AI4D-Africa is a network of excellence in AI in sub-Saharan Africa. It is aimed at strengthening and developing community, scientific and technological excellence in a range of AI-related areas. It is composed of African Artificial Intelligence researchers, practitioners and policymakers.

Datasets

The data was collected from news publications in Malawi. tNyasa Ltd Data Science Lab have used three main broadcasters: the Nation Online newspaper, Radio Maria and the Malawi Broadcasting Corporation. The articles presented in the dataset are full articles and span many different genres: from social issues, family and relationships to political or economic issues.

The articles were cleaned by removing special characters and html tags.

Your task is to classify the news articles into one of 19 classes. The classes are mutually exclusive.

List of classes: [‘SOCIAL ISSUES’, ‘EDUCATION’, ‘RELATIONSHIPS’, ‘ECONOMY’, ‘RELIGION’, ‘POLITICS’, ‘LAW/ORDER’, ‘SOCIAL’, ‘HEALTH’, ‘ARTS AND CRAFTS’, ‘FARMING’, ‘CULTURE’, ‘FLOODING’, ‘WITCHCRAFT’, ‘MUSIC’, ‘TRANSPORT’, ‘WILDLIFE/ENVIRONMENT’, ‘LOCALCHIEFS’, ‘SPORTS’, ‘OPINION/ESSAY’]

Files available for download:

  • Train.csv – contains the target. This is the dataset that you will use to train your model.
  • Test.csv- resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your mode.
  • SampleSubmission.csv – shows the submission format for this competition, with the ‘ID’ column mirroring that of Test.csv. The order of the rows does not matter, but the names of the IDs must be correct.

Partners

AI4D-Africa; Artificial Intelligence for Development-Africa Network
AI4D-Africa; Artificial Intelligence for Development-Africa Network

 

Description

Ewe and Fongbe are Niger–Congo languages, part of a cluster of related languages commonly called Gbe. Fongbe is the major Gbe language of Benin (with approximately 4.1 million speakers), while Ewe is spoken in Togo and southeastern Ghana by approximately 4.5 million people as a first language and by a million others as a second language. They are closely related tonal languages, and both contain diacritics that can make them difficult to study, understand, and translate.

Although those languages are at the core of the economic and social life of at least 3 major West African capital cities (namely Cotonou, Lome and Accra), they are today mostly spoken and very rarely written. Due to that fact (among other reasons), there is very little official or formal communication in those languages, leaving non-French/English speakers often unable to access critical facilities like education, banking, and healthcare. This challenge is part of an initiative that wishes to bring down the barriers between African local language speakers and modern society.

The objective of this challenge is to create a machine translation system capable of converting text from French into Fongbe or Ewe. You may train one model per language or create a single model for both. You may not use any external data, so a key component of this competition is finding a way to work with the available data efficiently.

This is a pioneer competition as far as low-resourced West African languages are concerned. A good solution would be a model that can be improved upon or used by researchers across the world to create APIs that can be integrated into day-to-day tools like ATMs, delivery applications etc., and help bridge the gap between rural West Africa and the modernized services.

This competition is one of five NLP challenges we will be hosting on Zindi as part of AI4D’s ongoing African language NLP project, and is a continuation of the African language dataset challenges we hosted earlier this year. You can read more about the work here.

About Takwimu Lab (takwimulab.gitlab.io)

TakwimuLab is an association of francophone west african who are professionals and enthusiasts about AI technologies. Our goal is to spread awareness about the challenges AI can help solve in our communities, disseminate knowledge and build solutions that can resolve real issues in our countries. Takwimu Lab is based in Benin.

Data

This is a parallel corpus dataset for machine translation from French to Ewe and French to Fongbe, languages from Togo and Benin respectively. It contains roughly 23 000 French to Ewe and 53 000 French to Fongbe parallel sentences, collected from blogs, tales, newspapers, daily conversations, webpages and annotated for neural machine translation. The collected sentences were preprocessed and aligned manually.

Variable definitions

  • ID : Unique identifier of the text
  • French : Text in French
  • Target_Laguauge: The target language
  • Target : Text in Fongbe or Ewe

Files available for download:

  • Train.csv – contains parallel sentences for training your model or models. There are 77,177 rows, of which 53,366 are French-Fongbe and 23,811 are French-Ewe
  • Test.csv- resembles Train.csv but without the Target column. This is the dataset on which you will apply your model(s).
  • SampleSubmission.csv – shows the submission format for this competition, with the ID column mirroring that of Test.csv and the ‘Target’ column containing your translation in Ewe or Fongbe. The order of the rows does not matter, but the names of the ‘ID’ must be correct.

Partners

AI4D-Africa; Artificial Intelligence for Development-Africa Network
AI4D-Africa; Artificial Intelligence for Development-Africa Network

Description

On social media, Arabic speakers tend to express themselves in their own local dialect. To do so, Tunisians use ‘Tunisian Arabizi’, where the Latin alphabet is supplemented with numbers. However, annotated datasets for Arabizi are limited; in fact, this challenge uses the only known Tunisian Arabizi dataset in existence.

Sentiment analysis relies on multiple word senses and cultural knowledge, and can be influenced by age, gender and socio-economic status.

For this task, we have collected and annotated sentences from different social media platforms. The objective of this challenge is to, given a sentence, classify whether the sentence is of positive, negative, or neutral sentiment. For messages conveying both a positive and negative sentiment, whichever is the stronger sentiment should be chosen. Predict if the text would be considered positive, negative, or neutral (for an average user). This is a binary task.

Such solutions could be used by banking, insurance companies, or social media influencers to better understand and interpret a product’s audience and their reactions.

This competition is one of five NLP challenges we will be hosting on Zindi as part of AI4D’s ongoing African language NLP project, and is a continuation of the African language dataset challenges we hosted earlier this year. You can read more about the work here.

About Icompass

iCompass is a Tunisian startup, created in July, 2019 and labelled startup act in August 2019. iCompass is specialized in the Artificial Intelligence field, and more particularly in the Natural Language Processing field. The particularity of iCompass is breaking the language barrier by developing systems that understand and interpret local dialects, especially African and Arab ones.

Partners

AI4D-Africa; Artificial Intelligence for Development-Africa Network
AI4D-Africa; Artificial Intelligence for Development-Africa Network

Description

Machine translation (MT) is a popular Natural Language Processing (NLP) task which involves the automatic translation of sentences from a source language to a target language. Machine translation models are very sensitive to the domain they were trained on which limit their generalization to multiple domains of interest like legal or medical domains. The problem is more severe in low-resource languages like Yorùbá where the most available datasets used for training are in the religious domain like JW300.

How can we train MT models to generalize to multiple domains or quickly adapt to new domains of interest? In this challenge, you are provided with 10,000 Yorùbá to English parallel sentences sourced from multiple domains like news articles, ted talks, movie transcripts, radio transcripts, software localization texts, and other short articles curated from the web. Your task is to train a multi-domain MT model that will perform very well for practical use cases.

The goal of this challenge is to build a machine translation model to translate sentences from Yorùbá language to English language in several domains like news articles, daily conversations, spoken dialog transcripts and books. Your solution will be judged by how well your translation prediction is semantically similar to the reference translation.

The translation models developed will assist human translators in their jobs, help English speakers to have better communication with native speakers of Yorùbá, and improve the automatic translation of Yorùbá web pages to English language.

This competition is one of five NLP challenges we will be hosting on Zindi as part of AI4D’s ongoing African language NLP project, and is a continuation of the African language dataset challenges we hosted earlier this year. You can read more about the work here.

About Masakhane

Masakhane is the open research, participatory, grassroots NLP initiative for Africans by Africans, with the aim of putting African research in NLP on the map, by holistically tackling the problems facing society. Founded in 2019, Masakhane has since garnered over 400 researchers from over 30 African countries, published state of the art research for over 38 African languages at various venues, and has built a thriving community. Masakhane’s participatory approach has enabled researchers without formal scientific training to contribute data, evaluations and models to published research, by focusing on lowering the barriers of entry.

Partner

AI4D-Africa; Artificial Intelligence for Development-Africa Network
AI4D-Africa; Artificial Intelligence for Development-Africa Network