GRavitational lEnsing Accuracy Testing 2010 (GREAT10) is a public image analysis challenge aimed at the development of algorithms to analyze astronomical images. Specifically, the challenge is to measure varying image distortions in the presence of a variable convolution kernel, pixelization and noise. This is the second in a series of challenges set to the astronomy, computer science and statistics communities, providing a structured environment in which methods can be improved and tested in preparation for planned astronomical surveys. GREAT10 extends upon previous work by introducing variable fields into the challenge. The “Galaxy Challenge” involves the precise measurement of galaxy shape distortions, quantified locally by two parameters called shear, in the presence of a known convolution kernel. Crucially, the convolution kernel and the simulated gravitational lensing shape distortion both now vary as a function of position within the images, as is the case for real data. In addition, we introduce the “Star Challenge” that concerns the reconstruction of a variable convolution kernel, similar to that in a typical astronomical observation. This document details the GREAT10 Challenge for potential participants. Continually updated information is also available from this http URL

In 2006 the PASCAL network funded the 1st Speech Separation challenge which addressed the problem of separating and recognising speech mixed with speech. We are now launching a successor to this challenge that aims to tackle speech separation and recognition in more typical everyday listening conditions.

The challenge employs noise background that has been collected from a real family living room using binaural microphones. Target speech commands have been mixed into the environment at a fixed position using genuine room impulse responses. The task is to separate the speech and recognise the commands being spoken using systems that have been trained on noise-free commands and room noise recordings.

On this web site you will find everything you need to get started. The background section explains the general motivation for the challenge. The instructions section describes the separation/recognition task in more detail and what you need to do in order to take part. The datasets are available for download already, and the evaluation tools will become available by the end of October. Further important dates are listed in the schedule.

This is a multidisciplinary challenge. We hope to encourage participants from the machine learning, source separation and speech recognition communities. Although the ultimate evaluation will be through speech recognition scores, participants may submit either separated signals, robust feature extractors or complete recognition systems. All entrants will be invited to submit papers describing their work to a dedicated satellite workshop hosted at Interspeech 2011. Participants will also be invited to submitted longer versions of their work for a Special Issue of Computer Speech and Language on the theme of Machine Listening in Multisource Environments has been organised.

General questions [and terms]

Why is the shared task called Zulu?

The reason for which learning languages by active learning is an important topic in the field of computational linguistics is that it is not always possible to rely on statistics. This is particularly the case for those languages of which large corpora are not available. This question is known sometimes as working on under-resourced languages.

In a couple of sentences, what is it about?

Zulu is an active learning task. The participants are asked to build algorithms that can learn languages by querying a server via membership queries. The targets are regular languages, and for each task a fixed number of queries is allowed.

Can anyone participate?

Yes. Anyone can. One should remember that the winner of the 1998 Abbadingo competition (Nick Price) was just a computer scientist not affiliated to any research lab! Typically, students that have learnt about Dana Angluin’s L* algorithm should be interested. But we hope that researchers in computational linguistics, machine learning or any field which needs grammatical inference will be interested.

What is a task? A task owner?

A task is set when a new target automaton is created. During the training phase anyone can create a new task of which he becomes the task owner. The task owner is the first to try to solve a given task. His score appears on the web and can be used as a guideline by the other players.

What is the baseline algorithm?

The baseline algorithm is a simple adaptation of Angluin’s 1987 L* algorithm. It makes membership queries in order to obtain a table which is both closed and complete. At this point it samples some strings (using a distribution based on the size of the current hypothesis), and tests (again via membership queries) the labelling of the sampled strings against the labelling returned by the oracle. If all strings are perfectly labelled, the algorithm decides that (under the important assumption that the chosen distribution was good enough) the error cannot be too large. It can now test. If some example appears to be badly labelled, the algorithm can use this as a counterexample from an equivalence query.
More details concerning the baseline algorithm can be found here.
The baseline algorithm has two purposes. It is used to compute the number of queries allowed for the task, but it can also be used by any user as a starting point to participate in the competition.

Technical questions

Do I have to use Java?

No. The baseline is written in Java, but since the programme will run on your own machine, you can choose anything. Please refer to the Develop your algorithm in your language section.

Why do I not see my score with the score of the task owner?

Because this is only training phase. People might get things wrong and would cease to participate if their first trials were consistently beaten by everyone else. Furthermore, the second time one plays one cannot avoid different types of hill climbing: you could rely on the results from queries done by another algorithm, or you know the exact size.
But you are encouraged to discuss with the task-owner if you want to know more.

Then why play a task that I don’t own?

Because sometimes it can seem challenging to see how well our algorithm is progressing.

How are the DFA generated?

We are using here the same programs that were used by Gowachin. Details can be found on the webpage, and in the associated publications [lang1998rao]. To make things simple, we start by randomly drawing a number n at most as large as the parameter the task owner proposes. We then construct a random graph with slightly more elements than n and randomly decide that some states are final and others are not. Then minimisation takes place.

How are the data from the test set generated?

Again we follow the protocol from Abbadingo and Gowachin. A maximum length k is defined that depends on the thickness of the DFA. Then strings are randomly drawn from \Sigma^{\le k}

Why am I given 1723 queries?

This number is computed as follows: when a target is created, the baseline algorithm is run and when this algorithm is capable of doing less than 40% errors, the number of queries made up to that point is allocated to the task.

Why am I not given the real number of states of the target?

Two reasons: the first is that this is not about identification, and the second is that since L* makes use of equivalence queries, and we are not allowed these, he will have to do some sampling instead. But since he does not know what the real distribution is, this is a problem (and it is easy to show that in theory there is no good answer).

About the competition itself

What will the actual competition be like?

During the competition phase, the task owner will be the organisers. They will provide a series of tasks .

But I can see all sorts of loopholes and ways to collude. Are you not scared of cheating?

No. Obviously, before proclaiming who the winner is, the organisers will ask for the code and the sources. They will then run the winner’s code on their machines.

But why not do this for all players?

Because that would oblige us to select some very strict format. And here the idea is not to impose limitation on how the algorithms treat the information they gather, but only on how much information they gather.

What do we win?

The actual competition will probably provide a small financial prize for the winner. Furthermore, with the help of our scientific committee, we should be able to propose that the winner publishes his solution in the Machine Learning Journal.
We also hope to hold a workshop in the autumn of 2010.

About the theory, the DFA, active learning

What is the relevant theory about all this?

There are a number of papers one might read: Angluin’s papers on query learning, her description of L*, the chapter in Vazirani’s book [Il faut des references]

But… is it useful to learn DFA from membership queries ?

This may come as a surprise, but it is. Here are some examples.

  • Siemens
  • Squirrel

Odds and ends

I don’t find my question here. How can I ask it?

Simple: send us an e-mail. If the question (and the answer) are likely to be of interest to others, we will add it to this list.

I don’t think this is the right competition. Why are we not learning X from Y and count Z? (Here X can be any type of languages with all sorts of alphabet sizes),Y can be any sort of queries or information and Z any type of measure of the quantity of resources needed for learning)

Indeed, there is room for improvement. The feeling was that if active learning was an important question, we did not know enough about learning strategies, when the learner could interactively interrogate its environment. And in this case, the crucial resource is not the (CPU) time that the learning program is going to spend, but the number of times it uses the expensive resource, the oracle.
In all cases, since we have noted that this question is an important one, we will gladly discuss this in the forum.

What is best? To score 60% with 1000 queries or 70% with 2000?

Another good question. In preliminary experiments we have noted an (albeit uneven) exponential progression of the number of queries with the rate of success. For the moment, we fix the number of queries. If you can do better with less. Good, because it probably means that on another task you will score higher.

Lang, K.J. and Pearlmutter, B.A. and Price, R.A. (1998) Results of the abbadingo one DFA learning competition and a new evidence-driven state merging algorithm, Lecture Notes in Computer Science

Some applications of active learning in the case of finite automata

  • The earliest task was that of map building in robotics: the (simplified) map is a graph and the outputs in each state are what the robot may encounter in a state. A particularity here is that the robot cannot be reset: the learning algorithm is to learn from just one very long string and all its prefixes.


R. L. Rivest and R. E. Schapire. Inference of finite automata sequences. Information and Computation , 103:299-347, 1993.

  Thomas Dean , Dana Angluin, Kenneth Basye , Sean P. Engelson , Leslie Pack Kaelbling , Evangelos Kokkevis , Oded Maron : Inferring Finite Automata with Stochastic Output Functions and an Application to Map Learning. AAAI 1992 : 208-214

  • A task somehow related is that of discivering the rational strategy followed by an adversary. This line was looked into in a number of papers related to agent technologies.


D. Carmel and S. Markovitch. Model-based learning of interaction strategies in multi-agent systems. Journal of Experimental and Theoretical Artificial Intelligence , 10(3):309-332, 1998.

D. Carmel and S. Markovitch. Exploration strategies for model-based learning in multiagent systems. Autonomous Agents and Multi-agent Systems , 2(2):141-172, 1999.

  • One task on which grammatical inference is proving to be particularly useful is that of wrapper induction: The idea is to find in a web page (or various of the same type) all arguments of  a special sort. In this context, the role of the Oracle is played by the human user.


J. Carme, R. Gilleron, A. Lemay, and J. Niehren. Interactive learning of node selecting tree transducer. In Ijcai Workshop on Grammatical Inference , 2005.

Julien Carme , Rémi Gilleron , Aurélien Lemay , Joachim Niehren: Interactive learning of node selecting tree transducer. Machine Learning 66 (1): 33-67 (2007)

  • In different tasks linked with checking if the specifications of a system or a piece of hardware are met, the item to be checked is used as an oracle. Queries are made in order to build a model of the item, and then the learned model can be checked against the specifications. Obviously, there is an issue with the distribution used to simulate the equivalence query in this context to check through interaction with the actual chips used as an Oracle that the specifications are met..


L. Bréhélin, O. Gascuel, and G. Caraux. Hidden Markov models with patterns to learn boolean vector sequences and application to the built-in self-test for integrated circuits. Pattern Analysis and Machine Intelligence , 23(9):997-1008, 2001.

T. Berg, O. Grinchtein, B. Jonsson, M. Leucker, H. Raffelt, and B. Steffen. On the correspondence between conformance testing and regular inference. In Proceedings of Fundamental Approaches to Software Engineering, 8th International Conference, FASE 2005 , volume 3442 of Lncs , pages 175-189. Springer-Verlag, 2005.

H. Raffelt and B. Steffen. Learnlib: A library for automata learning and experimentation. In Proceedings of Fase 2006 , volume 3922 of Lncs , pages 377-380. Springer-Verlag, 2006.

Other links

Dana Angluin: A Note on the Number of Queries Needed to Identify Regular Languages Information and Control 51 (1): 76-87 (1981)

Dana Angluin: Queries and Concept Learning. Machine Learning 2 (4): 319-342 (1987)

Dana Angluin: Learning Regular Sets from Queries and Counterexamples Inf. Comput. 75 (2): 87-106 (1987)

Dana Angluin: Equivalence Queries and Approximate Fingerprints. COLT 1989 : 134-145

Dana Angluin, Michael Kharitonov : When Won’t Membership Queries Help? (Extended Abstract) STOC 1991 : 444-454

Dana Angluin: Queries revisited. Theor. Comput. Sci. 313 (2): 175-194 (2004)

C. de la Higuera. A bibliographical study of grammatical inference. Pattern Recognition , 38:1332-r1348, 2005.

C. de la Higuera. Data complexity issues in grammatical inference. In M. Basu and T. Kam Ho, editors, Data Complexity in Pattern Recognition , pages 153-172. Springer-Verlag, 2006.

C. de la Higuera. Ten open problems in grammatical inference. In Sakakibara et al. [SKS+ 06], pages 32-44.

M. J. Kearns and U. Vazirani. An Introduction to Computational Learning Theory . MIT  press, 1994.

The UAI Approximate Inference Competition is now over. For a summary of the competition click here

We are pleased to announce an upcoming evaluation of probabilistic approximate inference algorithms, as part of the UAI conference this July. All researchers working on inference in graphical models are encouraged to participate.

The evaluation will focus on the following computational tasks:

  • Calculating MAP assignments
  • Calculating partition functions
  • Calculating marginals

Submitted algorithms can compete in some or all of these tasks.

New features in this year’s competition:

  • A special track will evaluate how well inference algorithms do when used as a black box for parameter learning.
  • There will be separate tracks for approximate algorithms that provide upper or lower bound guarantees and those that do not. Exact algorithms are of course invited to participate in any of these tracks.
  • Participants will receive automated performance reports and will be able to submit improvements during the entire competition period. A summarized leader board will be updated continuously.
  • There will be separate tracks for different running time regimes (i.e., 20 seconds, 20 minutes, two hours).

We look forward to your participation,
The organizers


  • Four free UAI registrations will be awarded to the winning participants

Community Recognition

  • The top participants will have 10 minutes to present their solvers in the main track of the conference


  • May 3 – Dataset and submission format announced.
  • May 10 – Submissions accepted on website.
  • July 1 – Final deadline for submissions.
  • July 4 – Final results and winners announced.
  • July 8-11 – Results reported at the UAI conference.


  • Gal Elidan – Hebrew University
  • Amir Globerson – Hebrew University

Program Committie

  • Jeff Bilmes – Univ. of Washington
  • Rina Dechter – UC Irvine
  • Peter Grunwald – CWI
  • Isabelle Guyon – Clopinet
  • Peter Spirtes – CMU

Student Organizer

  • Uri Heinemann – Hebrew University


Given two text fragments called ‘Text’ and ‘Hypothesis’, Textual Entailment Recognition is the task of determining whether the meaning of the Hypothesis is entailed (can be inferred) from the Text. The goal of the first RTE Challenge was to provide the NLP community with a benchmark to test progress in recognizing textual entailment, and to compare the achievements of different groups. Since its inception in 2004, the PASCAL RTE Challenges have promoted research in textual entailment recognition as a generic task that captures major semantic inference needs across many natural language processing applications, such as Question Answering (QA), Information Retrieval (IR), Information Extraction (IE), and multi-document Summarization.

After the first three highly successful PASCAL RTE Challenges, RTE became a track at the 2008 Text Analysis Conference, which brought it together with communities working on NLP applications. The interaction has provided the opportunity to apply RTE systems to specific applications and to move the RTE task towards more realistic application scenarios.

RTE-6 Tasks

The RTE-6 tasks focus on recognizing textual entailment in two application settings: Summarization and Knowledge Base Population.

  1. Main Task (Summarization scenario): Given a corpus and a set of “candidate” sentences retrieved by Lucene from that corpus, RTE systems are required to identify all the sentences from among the candidate sentences that entail a given Hypothesis. The RTE-6 Main Task is based on the TAC Update Summarization Task. In the Update Summarization Task, each topic contains two sets of documents (“A” and “B”), where all the “A” documents chronologically precede all the “B” documents. An RTE-6 Main Task “corpus” consists of 10 “A” documents, while Hypotheses are taken from sentences in the “B” documents.
  2. KBP Validation Pilot (Knowledge Base Population scenario): Based on the TAC Knowledge Base Population (KBP) Slot-Filling task, the new KBP validation pilot task is to determine whether a given relation (Hypothesis) is supported in an associated document (Text). Each slot fill that is proposed by a system for the KBP Slot-Filling task would create one evaluation item for the RTE-KBP Validation Pilot: The Hypothesis would be a simple sentence created from the slot fill, while the Text would be the source document that was cited as supporting the slot fill.

RTE-6 does not include the traditional RTE Main Task which was carried out in the first five RTE challenges; i.e., there will be no task to make entailment judgments over isolated T-H pairs drawn from multiple applications. Instead, the new Main Task for RTE-6 is based on only the Summarization application setting. The RTE-6 Main Task is similar to the RTE-5 Search Pilot, with the following changes:

  • RTE-6 hypotheses will be taken from sentences in the “B” documents, rather than from Summary Content Units created from human-authored summaries of the “A” documents.
  • Rather than searching for entailing sentences from the entire corpus, a Lucene baseline will first be run to retrieve a smaller number of candidate sentences for entailment.
  • The exploratory effort on resource evaluation will continue through ablation tests for the new RTE-6 Main Task.


RTE-6 Schedule
April 30 Main Task: Release of Development Set
May 10 KBP Validation Pilot: Release of Development Set
May 21 Deadline for TAC 2010 track registration
August 17 KBP Validation Pilot: Release of Test Set
August 30 Main Task: Release of Test Set
September 9 Main Task: Deadline for task submissions
September 16 Main Task: Release of individual evaluated results
September 17 KBP Validation Pilot: Deadline for task submissions
September 24 KBP Validation Pilot: Release of individual evaluated results
September 26 Deadline for TAC 2010 workshop presentation proposals
September 30 Main Task: Deadline for ablation tests submissions
October 7 Main Task: Release of individual ablation test results
October 27 Deadline for systems’ reports
November 15-16 TAC 2010 Workshop

Mailing List

The mailing list for the RTE Track is The list is used to discuss and define the task guidelines for the track, as well as for general discussion related to textual entailment and its evaluation. To subscribe, send a message to such that the body consists of the line:
subscribe rte <FirstName> <LastName>
In order for your messages to get posted to the list, you must send them from the email address used when you subscribed to the list. To unsubscribe, send a message from the subscribed email address to such that the body consists of the line:
unsubscribe rte
For additional information on how to use mailing lists hosted at NIST, send a message to such that the body consists of the line:

Organizing Committee

      Luisa Bentivogli, CELCT and FBK, Italy
      Peter Clark, Boeing, USA
      Ido Dagan, Bar Ilan University, Israe
      Hoa Trang Dang, NIST, USA
      Danilo Giampiccolo, CELCT, Italy


The Challenge results are now available as well as the Challenge workshop program and slides.

The objective of the Challenge is to design a statistical machine learning algorithm that discovers which morphemes (smallest individually meaningful units of language) words consist of. Ideally, these are basic vocabulary units suitable for different tasks, such as text understanding, machine translation, information retrieval, and statistical language modeling.

The scientific goals are:

  • To learn of the phenomena underlying word construction in natural languages
  • To discover approaches suitable for a wide range of languages
  • To advance machine learning methodology

Morpho Challenge 2010 is a follow-up to our previous Morpho Challenge 2005, 2007, 2008 and 2009. The task in 2010 is similar to 2009, where the aim was to find the morpheme analysis of the word forms in the data. As a new task we will provide a possibility for semi-supervised learning using the available linguistic gold standard morpheme analysis.

Participation in the previous challenges is by no means a prerequisite for participation in Morpho Challenge 2010. Everyone is welcome and we hope to attract many participating teams. The results will be presented in a workshop. Please read the rules and see the schedule. The datasets are available for download. Submit your analyses (result files) by sending them by email to the organizers, or by indicating a location where the organizers can download your files. Remember also to describe your algorithm in a paper. Please read the formatting instructions in rules.

If you plan to participate in Morpho Challenge, please contact the organizers using the email address in contact and ask to be added in our mailing list. We will use this mailing list to provide news about the tasks, data and evaluations.

We are looking forward to an interesting challenge!

Mikko Kurimo, Krista Lagus, Sami Virpioja and Ville Turunen
Adaptive Informatics Research Centre, Aalto University School of Science and Technology (previously known as Helsinki University of Technology)
The organizers


Mathias Creutz and Krista Lagus (2005). Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0. Publications in Computer and Information Science, Report A81, Helsinki University of Technology, March.
[ Article (PDF) ]Teemu Hirsimäki, Mathias Creutz, Vesa Siivola, Mikko Kurimo, Janne Pylkkönen, and Sami Virpioja (2006). Unlimited vocabulary speech recognition with morph language models applied to Finnish.Computer Speech and Language, Volume 20, Issue 4, October, pages 515–541.
[ Article (PDF) ]

Sami Virpioja, Jaakko J. Väyrynen, Mathias Creutz, and Markus Sadeniemi (2007). Morphology-aware statistical machine translation based on morphs induced in an unsupervised manner. In Proceedings of the Machine Translation Summit XI, pages 491–498.
[ Article (PDF) ]

Mikko Kurimo, Sami Virpioja, Ville T. Turunen, Graeme W. Blackwood, and William Byrne. Overview and results of Morpho Challenge 2009. In Working Notes for the CLEF 2009 Workshop, Corfu, Greece, September 2009.
[ Article (PDF) ]


Submission of large data files

Send an email to the organizers morphochallenge2010<> and tell where they can download the data files. Small data files (but not larger than a few MBs) can be emailed directly. Please, follow carefully the format of the result files described in datasets.


The organizers retain all rights to the Challenge data, which is given to the participants for use in this challenge only. The organizers may use the data submitted to the Challenge freely, without restrictions.


Anyone is allowed to participate. A participant may be either a single person or a group. A single person can participate in at most two groups. A participant is allowed to submit at most three different solutions, where each solution corresponds to a particular morpheme analysis method. Each of these methods may naturally be applied to each of the test languages. If a participant submits more than three solutions, the organizers decide which of the three will be accepted.

Test languages

Data sets are provided for four languages: English, Finnish, German and Turkish. Participants are encouraged to apply their algorithm to all of these test languages, but are free to leave some languages out, if they wish to do so.

(New languages may be added, if interested co-organizers, suitable data and evaluation analyses become available in time.)


The task is to develop a system that conducts unsupervised morpheme analysis for every word form contained in a word list supplied by the organizers for each test language.

The participants will be pointed to corpora in which the words occur, so that the algorithms may utilize information about word context.

New in 2010: A new category for semi-supervised learning algorithms using the available linguistic gold standard morpheme analysis. The abstracts submitted by the participants must contain clear descriptions of which steps of supervision or parameter optimization are involved in the algorithms.

Participants are also allowed to use additional text corpora from any source for training their algorithms.


The segmentations will be evaluated in three complementary ways:

  • Competition 1: The proposed morpheme analyses will be compared to a linguistic “gold standard”.
  • Competition 2: Information retrieval (IR) experiments will be performed, where the words in the documents and queries will be replaced by their proposed morpheme representations. The search will then be based on morphemes instead of words.
  • Competition 3: Machine Translation (MT) model is trained, where the words in the source language documents will be replaced by their proposed morpheme representations. The words in the source language evaluation data will then also be replaced by their proposed morpheme representations and the translation will be based on morphemes instead of words.

Competition 1 will include all four test languages. Winners will be selected separately for each language and category (unsupervised or semi-supervised). As a performance measure, the F-measure of accuracy of suggested morpheme analyses is utilized. Should two solutions produce the same F-measure, the one with higher precision will win.

Competition 2 will include three of the test languages. The organizers will perform the IR experiments based on the morpheme analyses submitted by the participants.

Competition 3 will include two of the test languages. Translation will be done from the test language to English. The organizers will train the translation models and perform the evaluation of the translations using an automatic metric such as BLEU.

Workshop and publication

All good results will be acknowledged with fame and glory. Presentations for the challenge workshop will be selected by the organizers based on the results and 4-page extended abstract describing the algorithm and the data submission. However, all groups who have submitted results and a paper are welcome to participate in the workshop to listen to the talks and join the discussions.

Extended abtracts

For the extended abstract you can use the two-column ACL format. For templates and general formatting instructions, see ACL 2010 Instructions for Preparing Camera-Ready Papers. The length of the paper should be around 4 pages. Email your extended abstract to the organizers by July 16.


In the case of disagreement the organizers will decide the final interpretation of the rules.

Competition 1

In Competition 1, for each language, the morpheme analyses proposed by the participants’ algorithm will be compared against a linguistic gold standard. Since the task at hand involves unsupervised learning, it cannot be expected that the algorithm comes up with morpheme labels that exactly correspond to the ones designed by linguists. That is, no direct comparison will take place between labels as such (the labels in the proposed analyses vs. labels in the gold standard). What can be expected, however, is that two word forms that contain the same morpheme according to the participants’ algorithm also have a morpheme in common according to the gold standard. For instance, in the English gold standard, the words “foot” and “feet” both contain the morpheme “foot_N”. It is thus desirable that also the participants’ algorithm discovers a morpheme that occurs in both these word forms (be it called “FOOT”, “morpheme784”, “foot” or something else).

In practice, the evaluation will take place by sampling a large number of word pairs, such that both words in the pair have at least one morpheme in common. As the evaluation measure, we will use F-measure, which is the harmonic mean of Precision and Recall:

F-measure = 1/(1/Precision + 1/Recall).

Precision is here calculated as follows: A number of word forms will be randomly sampled from the result file provided by the participants; for each morpheme in these words, another word containing the same morpheme will be chosen from the result file by random (if such a word exists). We thus obtain a number of word pairs such that in each pair at least one morpheme is shared between the words in the pair. These pairs will be compared to the gold standard; a point is given for each word pair that really has a morpheme in common according to the gold standard. The maximum number of points for one sampled word is normalized to one. The total number of points is then divided by the total number of sampled words.

For instance, assume that the proposed analysis of the English word “abyss” is: “abys +s”. Two word pairs are formed: Say that “abyss” happens to share the morpheme “abys” with the word “abysses”; we thus obtain the word pair “abyss – abysses”. Also assume that “abyss” shares the morpheme “+s” with the word “mountains”; this produces the pair “abyss – mountains”. Now, according to the gold standard the correct analyses of these words are: “abyss_N”, “abyss_N +PL”, “mountain_N +PL”, respectively. The pair “abyss – abysses” is correct (common morpheme: “abyss_N”), but the pair “abyss – mountain” is incorrect (no morpheme in common). Precision for the word “abyss” is thus 1/2 = 50%.

Recall is calculated analogously to precision: A number of word forms are randomly sampled from the gold standard file; for each morpheme in these words, another word containing the same morpheme will be chosen from the gold standard by random (if such a word exists). The word pairs are then compared to the analyses provided by the participants; a point is given for each sampled word pair that has a morpheme in common also in the analyses proposed by the participants’ algorithm. Points per word is normalized to one and the total number of points is divided by the total number of words.

For words that have several alternative analyses, as well as for word pairs that have more than one morpheme in common, normalization of the points is carried out. In short, an equal weight is given for each alternative analysis, as well as each word pair in an analysis. E.g., if a word has three alternative analyses, the first analysis has four morphemes, and the first word pair in that analysis has two morphemes in common, each of the two common morphemes will amount to 1/3*1/4*1/2=1/24 of the one point available for that word.

Small sets of the gold standards used are available on the datasets page for semi-supervised learning and parameter validation. Neither the labeled training set nor the development test set contain any of the word forms used in the final test set.

Evaluation of a sample (development test set)

You can evaluate your morphological analyses against the available gold standards (separately for each test language). The program to use for this is the Perl script: The evaluation program is invoked as follows: [-trace] wordpairsfile_goldstd wordpairsfile_result goldstdfile resultfile

Four files are given as arguments to

  1. wordpairsfile_goldstd: this is the “random word pairs file” available for download on the datasets page. This file is needed in the calculation of an estimate of the recall of the proposed morpheme analyses.
  2. wordpairsfile_result: this file has to be generated using another program (see below). It is needed in the calculation of a rough estimate of the precision of the proposed morpheme analyses.
  3. goldstdfile:this is the sample of the gold standard available for download on the datasets page. This file contains the correct morpheme analyses.
  4. resultfile: this is the result file that your algorithm produces, i.e., a list of words and their proposed morpheme analyses. Make sure that the file format is correct and does not include any additional whitespace.

The -trace argument is optional and produces output for every evaluated word separately. Regardless of the status of the trace argument, the evaluation program produces output of the following kind:

PART0. Precision: 69.00% (96/139); non-affixes: 81.55% (51/63); affixes: 58.73% (45/76)
PART0. Recall:    25.59% (142/556); non-affixes: 49.78% (105/211); affixes: 10.78% (37/345)
PART0. F-measure: 37.33%; non-affixes: 61.82%; affixes: 18.22%
TOTAL. Precision: 69.00%; non-affixes: 81.55%; affixes: 58.73%
TOTAL. Recall:    25.59%; non-affixes: 49.78%; affixes: 10.78%
TOTAL. F-measure: 37.33%; non-affixes: 61.82%; affixes: 18.22%

Note that results are displayed for partition 0 (PART0) and for the entire data (TOTAL). The total scores are here the same as the scores of PART0, since there is only one partition. It is, however, possible to split the data into several partitions and compute results for each partition separately. The overall scores are then calculated as the mean over the partitions. Splitting into partitions is a feature reserved for the final evaluation, when we will assess the statistical significance of the differences between the participants’ algorithms.

The figures that count in the final evaluation are the first precision, recall, and F-measure values on the TOTAL lines. These values pertain to all morphemes, but there are also separate statistics for morphemes classified as non-affixes vs. affixes. What counts as an affix is a morpheme with a label starting with a plus sign, e.g., “+PL”, “+PAST”. This naming convention is applied in the gold standard, which means that you do not have to do anything in order to get the non-affixes/affixes statistics right as far as recall is concerned. However, if you want the same kind of information also for precision, your algorithm must have a means of discovering which morphemes are likely affixes and tag these morphemes with an initial plus sign. Note that it is fully up to you whether you do this or not; it will not affect your position in the competition in any way.

Sampling word pairs for the calculation of an estimate of the precision

In order to get an estimate of the precision of the algorithm, you need to provide the evaluation script with a file containing word pairs sampled from your result file. Unfortunately, the estimate is likely to be fairly rough. The reason for this is that you do not have the entire gold standard at your disposal. Thus, if you sample pairs of words that are not included in the gold standard labels that you can access, it is impossible to know whether the proposed morphemes are correct or not. What you can do, however, is to make sure that each word that goes into a word pair actually does occur in the gold standard sample. There is still a problem that your algorithm might not propose that many common morphemes for the words within this limited set, and thus the estimate will be based on rather few observations.

Anyway, this is how to do it: First, make a list of relevant words, that is, words that are present in the gold standard sample available:

cut -f1 goldstdfile > relevantwordsfile

Then sample word pairs for 300 words selected by random from your results file: -n 300 -refwords relevantwordsfile < resultfile > wordpairsfile_result

You can use more than 300 words if you wish (the maximum is the amount of words in the gold standard file), or sample several word pair files with different random initializations (-rand switch), but remember that the estimate of precision that you will obtain is biased in any case, due to the limited amount of gold standard labels.

The necessary Perl program is The output file wordpairsfile_result is used as input to (see above).

Competition 2

Competition 2 does not necessarily require any extra effort by the participants. The organizers will use the analyses provided by the participants in information retrieval experiments. Data from CLEF will be used.
However, those participants who wish to submit morpheme analysis for words in their actual context (competition 2b), please contact the organizers for more information how to register to CLEF to obtain the full texts.

In the competition 2 (and 2b) the words in the queries and documents will be replaced by the corresponding morpheme analyses provided by the participants. We will perform the IR evaluation using the state-of-the-art Okapi (BM25) retrieval method (the latest version of the freely available LEMUR toolkit. The most common morphemes in each participant’s submission will be left out from the index. The size of this stoplist will be proportional to the amount of the text data in each language and the stoplist size will be the same for each participant’s submission. The evaluation criterion will be Uninterpolated Average Precision. The segmentation with the highest Average Precision will win. The winner is selected separately for competitions 2 and 2b in each language.

Competition 3

In competition 3, the morpheme analyses proposed by the participants’ algorithm will be evaluated in a statistical machine translation (SMT) framework. The translation models will be trained to translate from a morphologically complex source language to English. The words of the source language will be replaced by their morpheme analyses before training. The translations from this morpheme-to-word model will be combined with translations from a standard word-to-word translation model. For all models, we will use a state-of-the-art phrase-based SMT system. Evaluation of the translations will be performed by applying an automatic metric such as BLEU on a held-out test set.

Competition 3 does not require any extra effort by the participants. The organizers will use the analyses provided by the participants in machine translation experiments. Data from the Europarl corpus will be used. Those participants who wish to submit morpheme analysis for words in their actual context, please contact the organizers for information on how to get the full corpus.


The goal of this challenge is to recognize objects from a number of visual object classes in realistic scenes (i.e. not pre-segmented objects). It is fundamentally a supervised learning learning problem in that a training set of labelled images is provided. The twenty object classes that have been selected are:

  • Person: person
  • Animal: bird, cat, cow, dog, horse, sheep
  • Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train
  • Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor

There will be three main competitions: classification, detection, and segmentation; and three “taster” competition: person layout, action classification, and ImageNet large scale recognition:

Classification/Detection Competitions

  1. Classification: For each of the twenty classes, predicting presence/absence of an example of that class in the test image.
  2. Detection: Predicting the bounding box and label of each object from the twenty target classes in the test image.
    20 classes

Participants may enter either (or both) of these competitions, and can choose to tackle any (or all) of the twenty object classes. The challenge allows for two approaches to each of the competitions:

  1. Participants may use systems built or trained using any methods or data excluding the provided test sets.
  2. Systems are to be built or trained using only the provided training/validation data.

The intention in the first case is to establish just what level of success can currently be achieved on these problems and by what method; in the second case the intention is to establish which method is most successful given a specified training set.

Segmentation Competition

  • Segmentation: Generating pixel-wise segmentations giving the class of the object visible at each pixel, or “background” otherwise.
    Image Objects Class

Person Layout Taster Competition

  • Person Layout: Predicting the bounding box and label of each part of a person (head, hands, feet).
    Image Person Layout

Action Classification Taster Competition

  • Action Classification: Predicting the action(s) being performed by a person in a still image.
    9 action classes

ImageNet Large Scale Visual Recognition Taster Competition

The goal of this competition is to estimate the content of photographs for the purpose of retrieval and automatic annotation using a subset of the large hand-labeled ImageNet dataset (10,000,000 labeled images depicting 10,000+ object categories) as training. Test images will be presented with no initial annotation – no segmentation or labels – and algorithms will have to produce labelings specifying what objects are present in the images. In this initial version of the challenge, the goal is only to identify the main objects present in images, not to specify the location of objects.

Further details can be found at the ImageNet website.


To download the training/validation data, see the development kit.

The training data provided consists of a set of images; each image has an annotation file giving a bounding box and object class label for each object in one of the twenty classes present in the image. Note that multiple objects from multiple classes may be present in the same image. Some example images can be viewed online. A subset of images are also annotated with pixel-wise segmentation of each object present, to support the segmentation competition. Some segmentation examples can be viewed online.

Annotation was performed according to a set of guidelines distributed to all annotators.

The data will be made available in two stages; in the first stage, a development kit will be released consisting of training and validation data, plus evaluation software (written in MATLAB). One purpose of the validation set is to demonstrate how the evaluation software works ahead of the competition submission.

In the second stage, the test set will be made available for the actual competition. As in the VOC2008/VOC2009 challenges, no ground truth for the test data will be released.

The data has been split into 50% for training/validation and 50% for testing. The distributions of images and objects by class are approximately equal across the training/validation and test sets. In total there are 21,738 images. Further statistics are online.

Example images

Example images and the corresponding annotation for the classification/detection/segmentation tasks, and and person layout taster can be viewed online:

Development Kit

The development kit consists of the training/validation data, MATLAB code for reading the annotation data, support files, and example implementations for each competition.

The development kit will be available according to the timetable.

Test Data

The test data is now available. Note that the only annotation in the data is for the layout/action taster competitions. As in 2008/2009, there are no current plans to release full annotation – evaluation of results will be provided by the organizers.

The test data can now be downloaded from the evaluation server. You can also use the evaluation server to evaluate your method on the test data.

Useful Software

Below is a list of software you may find useful, contributed by participants to previous challenges.


  • 3 May 2010: Development kit (training and validation data plus evaluation software) made available.
  • 31 May 2010: Test set made available.
  • 30 August 2010, 23:00 GMT. Deadline for submission of results. There will be no further extensions.
  • 11 September 2010: Workshop in assocation with ECCV 2010, Crete.

Submission of Results

Participants are expected to submit a single set of results per method employed. Participants who have investigated several algorithms may submit one result per method. Changes in algorithm parameters do not constitute a different method – all parameter tuning must be conducted using the training and validation data alone.

Results must be submitted using the automated evaluation server:

It is essential that your results files are in the correct format. Details of the required file formats for submitted results can be found in the development kit documentation. The results files should be collected in a single archive file (tar/tgz/tar.gz). An example of the correct format is available:

The format of your results must match that specified in the development kit and in the example file, including both file names and directory structure. If you are not entering a particular competition, just omit the corresponding files.

Participants submitting results for several different methods (noting the definition of different methods above) should produce a separate archive for each method.

In addition to the results files, participants will need to additionally specify:

  • contact details and affiliation
  • list of contributors
  • brief description of the method

If you would like to submit a more detailed description of your method, for example a relevant publication, this can be included in the results archive.

Best Practice

The VOC challenge encourages two types of participation: (i) methods which are trained using only the provided “trainval” (training + validation) data; (ii) methods built or trained using any data except the provided test data, for example commercial systems. In both cases the test data must be used strictly for reporting of results alone – it must not be used in any way to train or tune systems, for example by runing multiple parameter choices and reporting the best results obtained.

If using the training data we provide as part of the challenge development kit, all development, e.g. feature selection and parameter tuning, must use the “trainval” (training + validation) set alone. One way is to divide the set into training and validation sets (as suggested in the development kit). Other schemes e.g. n-fold cross-validation are equally valid. The tuned algorithms should then be run only once on the test data.

In VOC2007 we made all annotations available (i.e. for training, validation and test data) but since then we have not made the test annotations available. Instead, results on the test data are submitted to an evaluation server.

Since algorithms should only be run once on the test data we strongly discourage multiple submissions to the server (and indeed the number of submissions for the same algorithm is strictly controlled), as the evaluation server should not be used for parameter tuning.

We encourage you to publish test results always on the latest release of the challenge, using the output of the evaluation server. If you wish to compare methods or design choices e.g. subsets of features, then there are two options: (i) use the entire VOC2007 data, where all annotations are available; (ii) report cross-validation results using the latest “trainval” set alone.

Publication Policy

The main mechanism for dissemination of the results will be the challenge webpage.

The detailed output of each submitted method will be published online e.g. per-image confidence for the classification task, and bounding boxes for the detection task. The intention is to assist others in the community in carrying out detailed analysis and comparison with their own methods. The published results will not be anonymous – by submitting results, participants are agreeing to have their results shared online.


If you make use of the VOC2010 data, please cite the following reference (to be prepared after the challenge workshop) in any publications:

	author = "Everingham, M. and Van~Gool, L. and Williams, C. K. I. and Winn, J. and Zisserman, A.",
	title = "The {PASCAL} {V}isual {O}bject {C}lasses {C}hallenge 2010 {(VOC2010)} {R}esults",
	howpublished = ""}	

Database Rights

The VOC2010 data includes images obtained from the “flickr” website. Use of these images must respect the corresponding terms of use:

For the purposes of the challenge, the identity of the images in the database, e.g. source and name of owner, has been obscured. Details of the contributor of each image can be found in the annotation to be included in the final release of the data, after completion of the challenge. Any queries about the use or ownership of the data should be addressed to the organizers.


  • Mark Everingham (University of Leeds),
  • Luc van Gool (ETHZ, Zurich)
  • Chris Williams (University of Edinburgh)
  • John Winn (Microsoft Research Cambridge)
  • Andrew Zisserman (University of Oxford)


We gratefully acknowledge the following, who spent many long hours providing annotation for the VOC2010 database:

Yusuf Aytar, Jan Hendrik Becker, Patrick Buehler, Miha Drenik, Chris Engels, Ali Eslami, Adrien Gaidon, Sam Johnson, Jyri Kivinen, Lubor Ladicky, Markus Mathias, Alastair Moore, Glenn Sheasby, Paul Sturgess, David Tingdahl, Josiah Wang.

We also thank Alexander Sorokin for production and support of the annotation systems for Mechanical Turk, Ivan Laptev for development of the action classification task, and Marcin Eichner, Ali Eslami, Lubor Ladicky, Marcin Marszalek, Arpit Mittal and Andrea Vedaldi for testing of the development kit and evaluation server. We are also most grateful to Yusuf Aytar for further development and administration of the evaluation server.


The preparation and running of this challenge is supported by the EU-funded PASCAL2 Network of Excellence on Pattern Analysis, Statistical Modelling and Computational Learning.

We are grateful to Alyosha Efros for providing additional funding for annotation on Mechanical Turk.

History and Background

The main challenges have run each year since 2005. For more background on VOC, the following journal paper discusses some of the choices we made and our experience in running the challenge, and gives a more in depth discussion of the 2007 methods and results:

The PASCAL Visual Object Classes (VOC) Challenge
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J. and Zisserman, A.
International Journal of Computer Vision, 88(2), 303-338, 2010
Bibtex source | Abstract | PDF

The table below gives a brief summary of the main stages of the VOC development.

Year Statistics New developments Notes
2005 Only 4 classes: bicycles, cars, motorbikes, people. Train/validation/test: 1578 images containing 2209 annotated objects. Two competitions: classification and detection Images were largely taken from exising public datasets, and were not as challenging as the flickr images subsequently used. This dataset is obsolete.
2006 10 classes: bicycle, bus, car, cat, cow, dog, horse, motorbike, person, sheep. Train/validation/test: 2618 images containing 4754 annotated objects. Images from flickr and from Microsoft Research Cambridge (MSRC) dataset The MSRC images were easier than flickr as the photos often concentrated on the object of interest. This dataset is obsolete.
2007 20 classes:

  • Person: person
  • Animal: bird, cat, cow, dog, horse, sheep
  • Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train
  • Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor

Train/validation/test: 9,963 images containing 24,640 annotated objects.

  • Number of classes increased from 10 to 20
  • Segmentation taster introduced
  • Person layout taster introduced
  • Truncation flag added to annotations
  • Evaluation measure for the classification challenge changed to Average Precision. Previously it had been ROC-AUC.
This year established the 20 classes, and these have been fixed since then. This was the final year that annotation was released for the testing data.
2008 20 classes. The data is split (as usual) around 50% train/val and 50% test. The train/val data has 4,340 images containing 10,363 annotated objects.
  • Occlusion flag added to annotations.
  • Test data annotation no longer made public.
  • The segmentation and person layout data sets include images from the corresponding VOC2007 sets.
2009 20 classes. The train/val data has 7,054 images containing 17,218 ROI annotated objects and 3,211 segmentations.
  • From now on the data for all tasks consists of the previous years’ images augmented with new images. In earlier years an entirely new data set was released each year for the classification/detection tasks.
  • Augmenting allows the number of images to grow each year, and means that test results can be compared on the previous years’ images.
  • Segmentation becomes a standard challenge (promoted from a taster)
  • No difficult flags were provided for the additional images (an omission).
  • Test data annotation not made public.
2010 20 classes. The train/val data has 10,103 images containing 23,374 ROI annotated objects and 4,203 segmentations.
  • Action Classification taster introduced.
  • Associated challenge on large scale classification introduced based on ImageNet.
  • Amazon Mechanical Turk used for early stages of the annotation.
  • Method of computing AP changed. Now uses all data points rather than TREC style sampling.
  • Test data annotation not made public.


The goal is to select/design a classifier (and any pre-processing systems, including a feature extractor) that correctly classifies EEG data into one of two classes. The winner will be the submission that maximizes the area under the ROC curve.


Anyone that has an interest in machine learning and that has access to Matlab®.


Registration is not required. However, if you wish to receive important updates on the competition by email then please send a request to

Training Data

The training data may be downloaded from mlsp2010TrainingData.mat (44 MB). The training data, which is in Matlab® format, consist of EEG data collected while a subject viewed satellite images that were displayed in the center of an LCD monitor approximately 43 cm in front of them.

There are 64 channels of EEG data. The total number of samples is 176378. The sampling rate is 256 Hz. There are 75 blocks and 2775 total satellite images. Each block contains a total of 37 satellite images, each of which measures 500 x 500 pixels. All images within a block are displayed for 100 ms and each image is displayed as soon as the preceding image is finished. Each block is initiated by the subject after a rest period, the length of which was not specified in advance.

The subject was instructed to fixate on the center of the images and to press the space bar whenever they detected an instance of a target, where the targets are surface-to-air missile (SAM) sites. Subjects also needed to press the space bar to initiate a new block and to clear feedback information that was displayed to the subject after each block.

We expect a particular neural signature, the P300, to occur whenever the subject detects an instance of a target in one of the satellite images. The P300 gets its name from the fact that detection of a (rare, pre-defined) target causes a deflection of the EEG signal aproximately 300 ms after the visual stimulus is presented to the subject. The P300 is most prominent in the midline channels (e.g., Pz, Cz, Fz; see cap_64_layout_medium.jpg for the layout of the channels). In addition, there is a separate neural signature associated with the pressing of the space bar. For our data, this second neural signal occurs around 500-600 ms after the stimulus containing the target is presented.

The variables in the training data are defined as follows,

Test Data

The test data are not made available to the participants. Instead, participants will submit Matlab® code, which the competition chairs will test by passing the test data to the submitted code.

Like before, there are 64 channels of EEG data, the sampling rate is 256 Hz, there is no delay between images within the same block, the subject rests between blocks for as long as they wish, and the subject pressed the space bar to signify that they detected a target, to initiate a new block, and to clear feedback information that was displayed after each block.

Unlike before, the test data consist of 890 blocks and 9891 satellite images, the total number of samples of EEG data is 1603334, every other image within a block is a mask image (mask images do not contain targets), the buttonTrigger variable is not available, and the imageTrigger variable takes only values of 0 or 1, where a 1 corresponds to the onset of each prospective target image (i.e., the satellite images) and 0 is used elsewhere. Another difference is that 4 different image durations are used in the test data. The image durations, which apply to both satellite and mask images, are (approximately) 50 ms, 100 ms, 150 ms, and 200 ms. All images within a given block have the same image duration and all blocks having a specified image duration are grouped together. Each block contains 22, 10, 7, and 5 prospective target images when the image duration is 50 ms, 100 ms, 150 ms, and 200 ms, respectively. Keep in mind that the time difference between successive prospective target images is twice the corresponding image duration due to the presence of the mask images. Hence, successive prospective target images within a block appear every (approximately) 100 ms, 200 ms, 300 ms, or 400 ms.


A successful submission consists of (1) the names of the team members (each person may belong to at most two teams and each team is allowed a single submission), (2) the name(s) of the host institutions of the researchers, (3) a 1-3 paragraph description of the approach used, and (4) Matlab® code, myFunction.m, which must conform to the requirements below.

The competition chairs will call the submitted code using,

>> [out] = myFunction(eegData,t,imageTrigger,eegLabel,eegCoord,thr);

where the first 5 input variables correspond to the test data, out is a (9891 x 1) vector consisting of 0’s and 1’s (each 0 corresponds to a predicted non-target and each 1 corresponds to a predicted target) and thr is a user-defined threshold that biases the predicted class (0 < thr < 1). When thr = 0, the code should predict that all prospective target images are non-targets and, when thr = 1, the code should predict that all prospective target images are targets. The thr variable will be used to construct an ROC curve. Example code that demonstrates the use of thr can be downloaded from myFunction.m. The code should not write to any drive, it must finish running in a reasonable time (on the order of minutes), and must consist only of regular, uncompiled Matlab® code (P-code, mex files, and compiled code are, e.g., forbidden). Parameter values for the trained classifier can be hardcoded into myFunction.m. Alternatively, entrants can read parameter values into myFunction.m from a single user-supplied *.mat file.


prior to submitting, we recommend testing the code on the training data (after performing the following: “f=find(imageTrigger==2); imageTrigger(f)=1;”).


Submissions must be emailed to no later than April 8, 2010.


The MLSP 2010 proceedings will include a publication, written by the competition chairs, which describes the competition and shows the comparison of the submitted methods. The committee chairs will invite up to three teams to submit a two-page summary that discusses the method they used in the competition (the mathematical notation must be consistent with the paper authored by the competition chairs). Pending approval, the invited summaries will be published in the conference proceedings. The process of selecting teams to which invitations to publish will be sent is distinct from selecting the winner of the competition. Invitations to publish will be based on both (1) utilization of machine learning principles and (2) performance in terms of area under ROC curve.


Up to two N900 high-performance mobile computers from Nokia will be awarded as prizes. The 2010 MLSP Competition Committee will distribute one award to each team or teams that they select, where the selection is based on: (1) the performance of the submitted methods and (2) the requirement that at least one member of each selected team attend the 2010 MLSP Conference. Members of the 2010 MLSP Competition Committee (and everyone belonging to any of the labs of the 2010 MLSP Competition Committee) are not eligible for this award.

In addition to the Nokia mobile devices, the MLSP Competition Committee will award a maximum of 1500 Euros to the selected winning team(s). The monetary award(s), provided by the PASCAL2 Challenge Program, are intended to defray some of the expenses related to traveling to and attending the MLSP 2010 Conference. More details will be made available at a later date.


[1] K.B. Campbell, E. Courchesne, T.W. Picton, and K.C. Squires, “Evoked potential correlations of human information processing,” Biological Psychology, Vol. 8, pp. 45-68, 1979.

[2] T.W. Picton, “The P300 wave of the human event-related potential,” Journal of Clinical Neurophysiology, Vol. 9, No. 4, pp. 456-479, 1992.

[3] Barry S. Oken, “Evoked Potentials in Clinical Medicine,” edited by K.H. Chiappa, Lippincott-Raven, Philadelphia, PA, 1997.