Description

The goal of the challenge is to identify the different Machine Learning (ML) methods proposed so far for structured data, to assess the potential of these methods for dealing with generic ML tasks in the structured domain, to identify the new challenges of this emerging field and to foster research in this domain. Structured data appears in many different domains. We will focus here on Graph document collections and we are organizing this challenge in cooperation with the INEX initiative. This challenge aims at gathering ML, Information Retrieval (IR) and Data Mining researchers in order to:

Define the new challenges for structured data mining with ML techniques.
Build Interlinked document collections, define evaluation methodologies and develop software which will be used for the evaluation of classification of documents in a graph.
Compare existing methods on different datasets.

Results of the track will be presented at the INEX workshop.

Task : Graph (Semi-)Supervised Classification

Dealing with XML document collections is a particularly challenging task for ML and IR. XML documents are de¯ned by their logical structure and their content (hence the name semi-structured data). Moreover, in a large majority of cases (Web collections for example), XML documents collections are also structured by links between documents (hyperlinks for example). These links can be of different types and correspond to different nformation: for example, one collection can provide hierarchical links, hyperlinks, citations, etc.

Earlier models developed in the field of XML categorization/clustering simultaneously use the content information and the internal structure of XML documents for a list of models) but they rarely use the external structure of the collection i.e the links between documents.

We focus here on the problem of classication of XML documents organized in graph. More precisely, the participants of the task have to classify the document of a partially labelled graph.

Tasks

Collection

The corpus used this year will be a subset of the Wikipedia XML Corpus of INEX 2009. This subset will be different than the one used last year. Mainly:

Each document will belong to one or more than one categories
Each document will be and XML document
The different documents will be organized in a graph of documents where each link correspond to an hyperlink (or wiki link) between two documents

The corpus proposed is a graph of XML documents.

Semi supervised classification

In this track, the goal is to classify each node of a graph (a node corresponds to a document) knowing a set of already labelled nodes (the training documents). In the ML point of view, the track proposed here is a transductive (or semi) supervised classification task.

The following figure gives an example of classification task.

Training set: The training set is composed of XML documents organized in a graph. The red nodes correspond to documents in category 1, the blue nodes corresponds to documents in category 2. The white nodes correspond to documents where the category is hidden. The goal of the categorization task is to find the categories of the white nodes

The goal of the categorization models are to find the color of the unlabelled nodes of the training graph.

The evaluation measure for categorization will be ROC curves and F1 measure

Results by team

The measures computed are:

Measures computed over the categories (micro and macro):
- ACC = Accuracy
- ROC = Arear under Roc curve
- PRF = F1 measure
Measure computed over the documents:
- Mean average precision by document
University of Wollongong
University of Peking
Xerox Research Center
University of Saint Etienne
University of Granada

Package for computing performances

In order to use the package, you have to write:

 perl compute.pl all_categories.txt train_categories.txt yourSubmissionFile

If the software find negative scores in the file, it normalizes the score by applying a logistic function over the scores.

XML Challenge Challenge