This dataset is part of a 3-4 month Fellowship Program within the AI4D – African Language Program, which was conceptualized as part of a roadmap to work towards better integration of African languages on digital platforms, in aid of lowering the barrier of entry for African participation in the digital economy.
This particular dataset is being developed through a process covering a variety of languages and NLP tasks, in particular Document Classification datasets of Chichewa.
Language profil: Chichewa
What is Chichewa?
Chichewa is part of the Niger-Congo Bantu group and it is one of the most spoken indigenous languages of Africa. Chichewa is both an individual dialect and a language group as we shall discuss in this short article.
The language, Chichewa, also written as Cichewa, or, in Zambia, Cewa, is the native language of the Chewa. The word ‘chi’ or ‘ci’ is a Bantu prefix used for the tribal name, designating the language rather than the geographical region of the tribe. The word Chewa is the name of a group of people. Chichewa is called Chinyanja, for example in Zambia and Mozambique. Chinyanja was also the old name for the language in Malawi, before the country became a Republic. During that time, as a British Protectorate, Malawi was called Nyasaland.
Chichewa, with the code ‘ny’ is also one of the 13 African languages with a Google automatic translation. The code ‘ny’ was most likely chosen because the language was known first as Chinyanja. This probably reflects the availability of written text in Chichewa compared to other African languages. However, as we will discuss in this article, there are several dialects of Chichewa which differ from each other in noticeable ways. I do not know whether this was taken into account for the text used in the machine language models by Google. But this is a whole new interesting topic in itself!
Who are the Chewa?
The Chewa are a Bantu speaking people, traditionally described as the descendants of the Maravi, who in the 16th (some say, in the 14th) century migrated to the present day Malawi from the region now called Congo-Kinshasa. Most of what we know about the migrations of the Cewa come from oral tradition. Samuel Nthara collected some of the oral traditions in his book Mbiri ya Achewa, published in 1944. The name Maravi first appeared in Portuguese documents in 1661.
Nowadays, some of the well known districts in Malawi where the Chewa live are: Mchinji, Lilongwe, Kasungu, Nkhotakota, Dowa and Dedza. The consensus is that the Chewa of the mainland kept their name as Chewa and lived mainly in the Central Region. The Manganja are the Chewa who settled in the Southern region. And some Chewa groups who settled at the lake or around the Shire River in the south are called Nyanja. Man’ganja (or Maganja) is southern Chichewa as opposed to the language spoken in the Central Region (which was also called Western Chichewa / Nyanja). There are phonetical, grammatical and vocabulary differences between these dialects.
Where is Chichewa spoken?
In Malawi, Chichewa is widely understood. It was declared the national language in 1968 and it is viewed as a symbol of national unity by diverse groups. In Mozambique it is spoken especially in the provinces of Tete and Niassa, where it is referred to as Chinyanja. In Zambia, it is spoken in Lusaka and in the Eastern Province (the language is referred to as Nyanja). The language spoken in Lusaka is sometimes called town-Nyanja as opposed to the Nyanja spoken in rural areas in other parts of Zambia, where it is referred to as deep-Nyanja. Nyanja is the language of the Police and the Army. In Zimbabwe, according to some estimates, Chichewa is the third most widely used language after Shona and Ndebele. There is a sizable community of descendents from those who migrated to this area from Nyasaland during colonial times to work in the mines.
Chichewa is spoken in South Africa. There are a significant number of migrants from Malawi who work in mining, as domestic workers or in other industries. There are radio services in Chichewa in Malawi, Zambia, South Africa and even in Ethiopia.
How many people speak the language?
According to sources quoted in Wikipedia, there are 12 million native speakers of Chichewa. A similar number is mentioned on the Joshua project website and includes Chichewa speakers from 8 countries of the world. This number seems then to refer to all the people who identify themselves as Chewa, Nyanja and Manganja, as these, according to the Malawi Population Census of 2018, make about 40% of the population in Malawi. However, in Malawi, the large ethnic groups of Lomwe, Yao and Ngoni have over the course of time adopted Chichewa as their native language.
It is the case that the number of people understanding and using Chichewa is much higher than the 12 million native speakers. Like Swahili, Chichewa is considered by some a universal language, a common skill enabling people of varying tribes and those living in Malawi, Zambia, Mozambique to communicate without following the strict grammar of specific local languages. In Zambia, many of those whose mother tongue is now Chinyanja have come to consider themselves Ngoni; Nyanja is a lingua franca, being spoken by the police and the administration.
The Need for Datasets in Chichewa
As discussed, seven important facts provide impetus to the initiative to develop data set for Chichewa: (1) Chichewa is an important African language, (2) it is representative of the Niger Congo Bantu group of languages, (3) it is widely spoken, (4) it contains a considerable literature, more than other local African languages, (5) there are several methodological grammar and phonetics studies and (6) several translations from languages such as English and (7) it is spoken by old and young alike.
There has been an interest in developing digital tools for language documentation and natural language processing. Such initiatives have come from researchers involved in linguistics, such as those belonging to linguistics departments at universities in Malawi and Zambia. For example, in Malawi, we found the Chichewa monolingual dictionary corpus containing about 13,000 nouns or this one phonetically annotated short corpus.
The comparative online Bantu dictionary at Berkley includes a dataset for Chichewa, however, the project seems to have stalled in 1997. More recently, there has been an interest in creating datasets used in NLP tools and machine translation and, recently, according to Professor Kishindo, there is a PhD candidate at the University of Malawi interested in working on Machine Translation for Chichewa.
From our investigation, we observe that these datasets or tools tend to be kept in the private domain, are not regularly maintained, or are used only once, and are not well documented. However, their existence is important and it shows that there is a desire and need for such tools.
Chichewa is an important African language. There are differences between the main dialects of Chichewa and the language is undergoing continuous change. Improved methods for discovering online content and digitizing text can open new opportunities for organising Chichewa text into useful corpora. These can then be useful in linguistic work, in building tools for manipulating and comparing text, for finding and visualising connections between texts and for improving machine translation.
Chichewa continues to change as new terms are added to the vocabulary arising from technological needs for example. Its use by the younger generation creates new idioms and meaning, and the creative expressions through poetry and literature find venues online. Looking at language in new and novel ways using technology, can also help engage with the new generation in how they use, view and develop their language.
In this short article, we looked at the use of Chichewa and why we think it is important to build data sets for this language. We hope that this will be motivating and inspiring to others who are interested in this language or other African languages. This article was written as the author embarked on an AI4D Language Dataset Fellowship for putting together a Chichewa dataset. This is a small but important initiative aimed at engaging with the Machine Learning generation on the African continent. I am honoured to be a small part in the building of such datasets.
Researcher Profile: Amelia Taylor
Amelia graduated with a PhD in Mathematical Logic from Heriot-Watt University in 2006 where I was part of the ULTRA group. After that she worked as a research assistant on a project with Heriot-Watt University and the Royal Observatory in Edinburgh, aiming at developing an intelligent query language for astronomical data. From 2006 to 2013, Amelia also worked in finance in the City of London and Edinburgh – she built risk models for asset allocation and liability-driven investments. F
or the last 5 years, Amelia has been teaching programming and AI courses at the University of Malawi in the CIT and engineering department. Amelia also teaches research methodology and supervises MSc and PhD students. While my first interest in AI as an undergraduate was in the field of Natural Language Processing and intelligent query systems, she is interested in the other use of technology and AI for solving real-world problems.
The designations employed and the presentation of material on these map do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning the legal status of any country, territory, city or any area or of its authorities, or concerning the delimitation of its frontiers or boundaries. Final boundary between the Republic of Sudan and the Republic of South Sudan has not yet been determined. Final status of the Abyei area is not yet determined.