2nd Pascal Challenge on Large Scale Hierarchical Classification

Second Pascal Challenge on
Large Scale Hierarchical Text classification

Web site: http://lshtc.iit.demokritos.gr/
Email: lshtc_info(at)iit.demokritos.gr

Following a successful first edition, we are pleased to announce the 2nd
edition of the Large Scale Hierarchical Text Classification (LSHTC) Pascal
Challenge. The LSHTC Challenge is a hierarchical text classification
competition, using large datasets. This year’s challenge will increase the
scale and the difficulty of the task, using data from Wikipedia
(www.wikipedia.org), in addition to the ODP Web directory data
(www.dmoz.org).

Hierarchies are becoming ever more popular for the organization of text
documents, particularly on the Web. Web directories and Wikipedia are two
examples of such hierarchies. Along with their widespread use, comes the
need for automated classification of new documents to the categories in the
hierarchy. As the size of the hierarchy grows and the number of documents to
be classified increases, a number of interesting machine learning problems
arise. In particular, it is one of the rare situations where data sparsity
remains an issue, despite the vastness of available data: as more documents
become available, more classes are also added to the hierarchy, and there is
a very high imbalance between the classes at different levels of the
hierarchy. Additionally, the statistical dependence of the classes poses
challenges and opportunities for the learning methods.

The challenge consists of three categorization tasks, involving different
documents and category systems. In particular, the largest category system,
based on Wikipedia, contains more than 300,000 categories and 2M documents
for training. The largest category system ever used in the past for
evaluation purposes, to the best of our knowledge, was based on the Yahoo!
Directory and contained 130,000 categories and 500,000 training documents.
In addition to the largest task, two smaller ones, based on Wikipedia and
DMOZ respectively, are included in the challenge. The scale of these is in
the order of the first edition of the challenge. All of the datasets in this
edition are multi-label. Particularly in the two datasets that are based on
Wikipedia, each document is assigned on average to 3.2 and 4.6 categories
respectively. Furthermore, the hierarchies are no longer simple tree
structures, as both documents and subcategories are allowed to belong to
more than one other category. More information regarding the tasks and the
challenge rules can be found at the challenge’s Web site; follow the “Tasks,
Rules and Guidelines” link.

As in the first edition, participants will be able to smoothly and
continuously submit runs, in order to improve their systems. This year we
also plan a two-stage evaluation of the participating methods: one measuring
classification performance and one for computational performance. It is
important to measure both, as they are dependent. The results will be
included in a final report about the challenge and we also aim at organizing
a special ECML’11 workshop.

In order to register for the challenge and gain access to the datasets you
must have an account at the challenge Web site.

Key dates:

Start of testing: January 15, 2011
End of testing: March 31, 2011
Submission of executables and short papers to challenge organizers: April
30, 2011
Submission of workshop papers: May 31, 2010
ECML’11 workshop (subject to approval): September 5, 2011

Organisers:

George Paliouras, NCSR “Demokritos”, Athens, Greece
Eric Gaussier, LIG, Grenoble, France
Aris Kosmopoulos, NCSR “Demokritos” & AUEB, Athens, Greece
Ion Androutsopoulos, AUEB, Athens, Greece
Thierry Artières, LIP6, Paris, France
Patrick Gallinari, LIP6, Paris, France