Algorithms for text classification still contain some open problems for example dealing with long pieces of texts and with texts in under-resourced languages.
This challenge gives participants the opportunity to improve on text classification techniques and algorithms for text in Chichewa. The texts are of varying length, some being quite long and will pose some challenges in chunking and classification. The texts are made up of news articles.
The objective of this challenge is to classify news articles.
We hope that your solutions will illustrate some challenges and offer solutions.
Algorithms for text classification have come a long way, but classifying long texts and working with under-resourced languages can still pose difficulties. This challenge gives participants the opportunity to improve on text classification techniques and algorithms for text in Chichewa. The texts are made up of news articles or varying lengths. The objective of this challenge is to classify these articles by topic. We hope that your solutions will illustrate some challenges and offer solutions.
Chichewa is a Bantu language spoken in much of Southern, Southeast and East Africa, namely the countries of Malawi and Zambia, where it is an official language, and Mozambique and Zimbabwe where it is a recognised minority language.
tNyasa Ltd Data Science Lab
We are a company based in Malawi offering intelligent technological solutions for the travel, technology, trade, cultural and education sector in Malawi. Part of the data Science Lab we work on language tools for Chichewa such as the construction and curation of data sets, speech to text and information processing.
AI4D-Africa is a network of excellence in AI in sub-Saharan Africa. It is aimed at strengthening and developing community, scientific and technological excellence in a range of AI-related areas. It is composed of African Artificial Intelligence researchers, practitioners and policymakers.
The data was collected from news publications in Malawi. tNyasa Ltd Data Science Lab have used three main broadcasters: the Nation Online newspaper, Radio Maria and the Malawi Broadcasting Corporation. The articles presented in the dataset are full articles and span many different genres: from social issues, family and relationships to political or economic issues.
The articles were cleaned by removing special characters and html tags.
Your task is to classify the news articles into one of 19 classes. The classes are mutually exclusive.
List of classes: [‘SOCIAL ISSUES’, ‘EDUCATION’, ‘RELATIONSHIPS’, ‘ECONOMY’, ‘RELIGION’, ‘POLITICS’, ‘LAW/ORDER’, ‘SOCIAL’, ‘HEALTH’, ‘ARTS AND CRAFTS’, ‘FARMING’, ‘CULTURE’, ‘FLOODING’, ‘WITCHCRAFT’, ‘MUSIC’, ‘TRANSPORT’, ‘WILDLIFE/ENVIRONMENT’, ‘LOCALCHIEFS’, ‘SPORTS’, ‘OPINION/ESSAY’]
Files available for download:
- Train.csv – contains the target. This is the dataset that you will use to train your model.
- Test.csv- resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your mode.
- SampleSubmission.csv – shows the submission format for this competition, with the ‘ID’ column mirroring that of Test.csv. The order of the rows does not matter, but the names of the IDs must be correct.