The challenge is about a new kind of task that combines a new modality, eye movements, to a user modeling task. The task is to infer the interests or state of the user based on the time series of her eye movements. Application areas will include attentive user interfaces and proactive (i.e. anticipatory) information retrieval.

The data comes from a research project which tries to complement or replace the laborious explicit relevance feedback in information retrieval by implicit feedback. The implicit feedback is derived from eye movements, which contain rich information about the attention and interest patterns of the user. The practical problem is that the signal is very noisy and the correspondence of the eye fixation patterns to user’s attention is often ambiguous.

In order to apply machine learning methods we collected a set of eye movement patterns in a controlled experimental setting where the relevance was known.  The user was asked a question and then shown a set of document titles on a display. Some of the titles were known to be relevant (class ”R”), one of them contained the answer to the question (class ”C”), and the rest were irrelevant (class ”I”). Eye movements were measured when the subjects were reading the titles.

We have so far established that relevance can be predicted to a certain degree. There is ample room for improvement, however, on the first feasibility studies. What is the best method is yet to be solved.

The Challenge

The challenge will consist of two parts: The first is directly relevant to much of current PASCAL research, and the second goes deeper into the eye movement signal.

In the first part, ”standard” feature extraction has already been carried out, and the data for each title to be classified is a time series of feature vectors. The feature set contains 21 features commonly used in psychological studies of reading, calculated for each word on the display. The task is to classify the sequence to one of the three classes, ”R”, ”C” or ”I”.

In the second, optional part, the data is the unprocessed trajectory of eye movements, that is, a sequence of two-dimensional coordinates telling where the user’s gaze was pointed. The data set will be complemented with the exact locations of the words of the titles on the screen. The task will be the same, to classify the trajectory to the three classes. This gives modelers the chance to improve on the traditional psychological feature selection.

The data set for the challenge will be measured in experiments with about 10 subjects, each reading 50 sets of titles.

The data sets will be available March 1st, 2005. The results of the competition will be evaluated by the best classification accuracy obtained using an unlabeled test dataset (a subset of data from the same subjects). The participants will submit the predicted classes of the test data titles to the organizers, who will then check the classification accuracy. An on-line validation data set will be available during the challenge.


Goals of the organizers

The goal of the “BCI Competition III” is to validate signal processing and classification methods for Brain-Computer Interfaces (BCIs). Compared to the past BCI Competitions, new challanging problems are addressed that are highly relevant for practical BCI systems, such as

Also this BCI Competition includes for the first time ECoG data (data set I) and one data set for which preprocessed features are provided (data set V) for competitors that like to focus on the classification task rather than to dive into the depth of EEG analysis.
The organizers are aware of the fact that by such a competition it is impossible to validate BCI systems as a whole. But nevertheless we envision interesting contributions to ultimately improve the full BCI.

Goals for the participants

For each data set specific goals are given in the respective description. Technically speaking, each data set consists of single-trials of spontaneous brain activity, one part labeled (training data) and another part unlabeled (test data), and a performance measure. The goal is to infer labels (or their probabilities) for the test set from training data that maximize the performance measure for the true (but to the competitors unknown) test labels. Results will be announced at the Third International BCI Meeting in Rensselaerville, June 14-19, and on this web site. For each data set, the competition winner gets a chance to publish the algorithm in an article devoted to the competition that will appear in IEEE Transactions on Neural Systems and Rehabilitation Engineering.


Results of the BCI Competition III

are available here.

BCI Competition III is closed

for submissions.

Description of Data Set I in ASCII format corrected

In the description of Data Set I in ASCII format (on the download web page) rows and columns were confused. The description is updated now.

Channel Labels in Preprocessed version of Data Set V in Matlab format corrected

In the Matlab format of Data Set V, the field clab of the variable nfo holds the channel labels. In the preprocessed version of this data set there are only 8 (spatially filtered) channels of the original 32. Erroneously, in the file with the preprocessed data nfo.clab contained all 32 channels, instead of the 8 channel subset. This is corrected now.

Restriction of the test data in data set IIIb

Please see additional information on data set IIIb.

Description of data set IVc updated

In the description of data set IVc it was said that there are 280 test trials. This information is wrong, the test set contains 420 trials. So the submission file to data set IVc must contain 420 lines of classifier output. The description is corrected now.

Submissions to Data Set IIIa and IIIb

Due to the large size of files for submissions to data set IIIa and IIIb, the files should not be sent by email, but put on the ftp server of TU-Graz: please log in by ftp to as user ftp with password ftp, go to the directory /incoming/bci2005/submissions/IIIa/ resp. /incoming/bci2005/submissions/IIIb/ and put you file there. Be sure to have transfer mode set to binary. If you have problems, please contact ⟨⟩.

Clarification of Rules for Data Set V

The description of data set V was updated in order to clarify the requirement ‘The algorithm should provide an output every 0.5 seconds using the last second of data.’, see description V.


Data sets

Data set I: ‹motor imagery in ECoG recordings, session-to-session transfer› (description I)
provided by Eberhard-Karls-Universität Tübingen, Germany, Dept. of Computer Engineering and Dept. of Medical Psychology and Behavioral Neurobiology (Niels Birbaumer), and
Max-Planck-Institute for Biological Cybernetics, Tübingen, Germany (Bernhard Schökopf), and
Universität Bonn, Germany, Dept. of Epileptology
cued motor imagery (left pinky, tongue) from one subject; training and test data are ECoG recordings from two different sessions with about one week in between
[2 classes, 64 ECoG channels (0.016-300Hz), 1000Hz sampling rate, 278 training and 100 test trials]

Data set II: ‹P300 speller paradigm› (description II)
provided by Wadsworth Center, NYS Department of Health (Jonathan R. Wolpaw, Gerwin Schalk, Dean Krusienski)
the goal is to estimate to which letter of a 6-by-6 matrix with successively intensified rows resp. columns the subject was paying attention to; data from 2 subjects
[36 classes, 64 EEG channels (0.1-60Hz), 240Hz sampling rate, 85 training and 100 test trials, recorded with the BCI2000 system]

Data sets IIIa: ‹motor imagery, multi-class› (description IIIa)
provided by the Laboratory of Brain-Computer Interfaces (BCI-Lab), Graz University of Technology, (Gert Pfurtscheller, Alois Schlögl)
cued motor imagery with 4 classes (left hand, right hand, foot, tongue) from 3 subjects (ranging from quite good to fair performance); performance measure: kappa-coefficient
[4 classes, 60 EEG channels (1-50Hz), 250Hz sampling rate, 60 trials per class]

Data sets IIIb: ‹motor imagery with non-stationarity problem› (description IIIb, additional information)
provided by TU-Graz (as above)
cued motor imagery with online feedback (non-stationary classifier) with 2 classes (left hand, right hand) from 3 subjects; performance measure: mutual information
[2 classes, 2 bipolar EEG channels 0.5-30Hz, 125Hz sampling rate, 60 trials per class]

Data set IVa: ‹motor imagery, small training sets› (description IVa)
provided by the Berlin BCI group: Fraunhofer FIRST, Intelligent Data Analysis Group (Klaus-Robert Müller, Benjamin Blankertz), and Campus Benjamin Franklin of the Charité – University Medicine Berlin, Department of Neurology, Neurophysics Group (Gabriel Curio)
cued motor imagery with 2 classes (right hand, foot) from 5 subjects; from 2 subjects most trials are labelled (resp. 80% and 60%), while from the other 3 less and less training data are given (resp. 30%, 20% and 10%); the challenge is to make a good classification even from little training data, thereby maybe using information from other subjects with many labelled trials.
[2 classes, 118 EEG channels (0.05-200Hz), 1000Hz sampling rate, 280 trials per subject]

Data set IVb: ‹motor imagery, uncued classifier application› (description IVb)
provided by the Berlin BCI group (see above)
training data is cued motor imagery with 2 classes (left hand, foot) from 1 subject, while test data is continuous (i.e., non-epoched) EEG; the challenge is to provide classifier outputs for each time point, although it is unknown to the competitors at what time points mental states changed; performance measure: mutal information with true labels (-1: left hand, 1: foot, 0: rest) averaged over all samples
[2 classes, 118 EEG channels (0.05-200Hz), 1000Hz sampling rate, 210 training trials, 12 minutes of continuous EEG for testing]

Data set IVc: ‹motor imagery, time-invariance problem› (description IVc)
provided by the Berlin BCI group (see above)
cued motor imagery with 2 classes (left hand, foot) from 1 subject (training data is the same as for data set IVb); test data was recorded 4 hours after the training data and contain an additional class ‘relax’; performance measure: mutal information with true labels (-1: left hand, 1: foot, 0: relax) averaged over all trials
[2 classes, 118 EEG channels (0.05-200Hz), 1000Hz sampling rate, 210 training trials, 420 test trials]

Data set V: ‹mental imagery, multi-class› (description V)
provided by IDIAP Research Institute (José del R. Millán)
cued mental imagery with 3 classes (left hand, right hand, word association) from 3 subjects; besides the raw signals also precomputed features are provided
[3 classes, 32 EEG channels (DC-256Hz), 512Hz sampling rate, continuous EEG and precomputed features]


December 12th 2004: launching of the competition
May 22nd 2005, midnight CET to May 23rd: deadline for submissions
June 16th 2005 (approx.): announcement of the results on this web site


Submissions to a data set are to be sent to the responsible contact person as stated in the data set description. The submission has to comprise the estimated labels, names and affiliations of all involved researchers and a short note on the involved processing techniques. We send confirmations for each submission we get. If you do not receive a confirmation within 2 days please resend your email and inform other organizing committee members, e.g., ⟨⟩, ⟨⟩, ⟨⟩, ⟨⟩, ⟨⟩
One researcher may NOT submit multiple results to one data set. She/he has to decide for her/his favorite one. However: From one research group multiple submissions to one data set are possible. The sets of involved researchers do not have to be disjoint, but (1) the ‘first author’ (contributor) should be distinct, and (2) approaches should be substantially different.
For details on how to submit your results please refer to the description of the respective data set. If questions remain unanswered send an email to the responsable contact person for the specific data set which is indicated in the description.
Submissions are evaluated for each data set separately. There is no need to submit for all data sets in order to participate.
Each participant agrees to deliver an extended description (1-2 pages) of the used algorithm for publication until July 31st 2005 in case she/he is the winner for one of the data sets.


Albany: Gerwin Schalk, Dean Krusienski, Jonathan R. Wolpaw

Berlin: Benjamin Blankertz, Guido Dornhege, Klaus-Robert Müller

Graz: Alois Schlögl, Bernhard Graimann, Gert Pfurtscheller

Martigny: Silvia Chiappa, José del R. Millán

Tübingen: Michael Schröder, Thilo Hinterberger, Thomas Navin Lal, Guido Widman, Niels Birbaumer


References to papers about past BCI Competitions.

  • Benjamin Blankertz, Klaus-Robert Müller, Gabriel Curio, Theresa M. Vaughan, Gerwin Schalk, Jonathan R. Wolpaw, Alois Schlögl, Christa Neuper, Gert Pfurtscheller, Thilo Hinterberger, Michael Schröder, and Niels Birbaumer. The BCI competition 2003: Progress and perspectives in detection and discrimination of EEG single trials. IEEE Trans. Biomed. Eng., 51(6):1044-1051, 2004.
  • The issue IEEE Trans. Biomed. Eng., 51(6) contains also articles of all winning teams of the BCI Competition 2003.
  • Paul Sajda, Adam Gerson, Klaus-Robert Müller, Benjamin Blankertz, and Lucas Parra. A data analysis competition to evaluate machine learning algorithms for use in brain-computer interfaces. IEEE Trans. Neural Sys. Rehab. Eng., 11(2):184-185, 2003.

References to BCI Overview papers.

  • Eleanor A. Curran and Maria J. Stokes. Learning to control brain activity: A review of the production and control of EEG components for driving brain-computer interface (BCI) systems. Brain Cogn., 51:326-336, 2003.
  • Jonathan R. Wolpaw, Niels Birbaumer, Dennis J. McFarland, Gert Pfurtscheller, and Theresa M. Vaughan. Brain-computer interfaces for communication and control. Clin. Neurophysiol., 113:767-791, 2002.
  • José del R. Millán. Brain-computer interfaces. In M.A. Arbib (ed.), “Handbook of Brain Theory and Neural Networks, 2nd ed.” Cambridge: MIT Press, 2002.
  • Andrea Kübler, Boris Kotchoubey, Jochen Kaiser, Jonathan Wolpaw, and Niels Birbaumer. Brain-computer communication: Unlocking the locked in. Psychol. Bull., 127(3):358-375, 2001.

References to BCI Special Issues.

  • IEEE Trans. Biomed. Eng., 51(6), 2004.
  • IEEE Trans. Neural Sys. Rehab. Eng., 11(2), 2003.
  • IEEE Trans. Rehab. Eng., 8(2), 2000.


Some ideas from the Workshop:

  • Defining good losses for probabilistic predictions is hard, since the losses might encourage strategies that are loss-dependent. Ideally one would want people to give an “honest” predictive distribution. Maybe one way of encouraging this would be to apply several losses that have contradictory properties. Another way could be to not to reveal the loss under which predictions will be evaluated.
  • Datasets and losses should not be chosen separately, since some losses are inappropriate for evaluating performance on certain problems. An example of this is using the log loss for regression in cases where it is expected to observe exactly the same target more than once. In such a case one could “cheat” by placing enormous amounts of density (with constant mass) on the target value in question.
  • Neural Networks are far from dead, and Kernel methods aren’t the best. The best performing methods were not Kernel methods, and Neural Nets did a very good job. Averaging, in a Bayesian way or by means of ensemble methods, did seem to give very good results. For example, ensembles of decision trees perform very well, and so did Bayesian Neural Nets.
  • Though Bayesian methods did well, they seemed not to be the only competitive method. Indeed, non-Bayesian approaches, like regression on the variance, did also perform very well.
  • Datamining is probably as important as Machine Learning …


The Challenge is Online


  • Please visit the new challenge webpage (link above) where the most up to date info is
  • The Challenge will be presented at the NIPS 2004 Workshop on Calibration and Probabilistic Prediction in Machine Learning
  • The Challenge will then be presented again at the April 2004 PASCAL Challenges Workshop in Southampton

Development Kit

The VOC challenge 2005 has now ended. The development kit which was provided to participants is available for download. This includes code for:

  • Loading PASCAL image sets and annotation
  • Computing Receiver Operating Characteristic (ROC) curves for classification
  • Computing Precision/Recall curves for detection
  • Computing Detection Error Tradeoff (DET) curves for detection

You can:


The two datasets provided for the challenge have been added to the main PASCAL image databases page. To run the challenge code you will need to download these two databases:


Results of the challenge were presented at the PASCAL Challenges workshop in April 2005, Southampton, UK. A chapter reporting results of the challenge will appear in Lecture Notes in Artificial Intelligence:

  • “The 2005 PASCAL Visual Object Classes Challenge” (3.7MB PDF)
    M. Everingham, A. Zisserman, C. Williams, L. Van Gool, M. Allan, C. Bishop, O. Chapelle, N. Dalal, T. Deselaers, G. Dorko, S. Duffner, J. Eichhorn, J. Farquhar, M. Fritz, C. Garcia, T. Griffiths, F. Jurie, D. Keysers, M. Koskela, J. Laaksonen, D. Larlus, B. Leibe, H. Meng, H. Ney, B. Schiele, C. Schmid, E. Seemann, J. Shawe-Taylor, A. Storkey, S. Szedmak, B. Triggs, I. Ulusoy, V. Viitaniemi, and J. Zhang.
    In Selected Proceedings of the First PASCAL Challenges Workshop, LNAI, Springer-Verlag, 2006 (in press).

An earlier report which includes some more detailed results is also available for download, and Powerpoint slides of the challenge workshop presentation:


The goal of this challenge is to recognize objects from a number of visual object classes in realistic scenes (i.e. not pre-segmented objects). It is fundamentally a supervised learning learning problem in that a training set of labelled images will be provided. The four object classes that have been selected are:

  • motorbikes
  • bicycles
  • people
  • cars

There will be two main competitions:

  1. For each of the 4 classes, predicting presence/absence of an example of that class in the test image.
  2. Predicting the bounding box and label of each object from the 4 target classes in the test image.

Contestants may enter either (or both) of these competitions, and can choose to tackle any (or all) of the four object classes. The challenge allows for two approaches to each of the competitions:

  1. Contestants may use systems built or trained using any methods or data excluding the provided test sets.
  2. Systems are to be built or trained using only the provided training data.

The intention in the first case is to establish just what level of success can currently be achieved on these problems and by what method; in the second case the intention is to establish which method is most successful given a specified training set.


The training data provided will consist of a set of images; each image has an annotation file giving a bounding box and object class label for each object in one of the four classes present in the image. Note that multiple objects from multiple classes may be present in the same image.

The data will be made available in two stages; in the first stage, a development kit will be released consisting of training and validation data, plus evaluation software (written in MATLAB). One purpose of the validation set is to demonstrate how the evaluation software works ahead of the competition submission.

In the second stage, two test sets will be made available for the actual competition:

  • In the first test set, images will be taken from the same distribution as the training data, expected to make an ‘easier’ challenge.
  • The second test set will be freshly collected to provide images not expected to come from the same distribution as the training data and should make a ‘harder’ challenge.

Contestants are free to submit results for any (or all) of the test sets provided.

Submission of results

Contestants may run several experiments on each competition of the challenge, for example using alternative methods or different training data. Contestants must assess their results using the software provided. This software writes standardized output files recording classifier output, ROC and precision/recall curves. For submission, contestants must prepare:

  • A set of output files from the provided software for each experiment
  • A specification of the experiments reported i.e. the meaning of each output file
  • A brief description of the method used in each experiment. Entrants may withhold details of their method, but will be judged in a separate category of the competition.

To submit your results, please prepare a single archive file (gzipped tar/zip) and place it on a publicly accessible web/ftp server. The contents should be as listed above. See the development kit documentation for information on the VOCroc and VOCpr functions needed to generate output files, and the location of these files. When you have prepared your archive and checked that it is accessible, send an email with the URL and any necessary explanatory notes to Mark Everingham.

Participants who cannot place their results on the web/ftp may instead send them by email as an attachment. Please include details of the attachment in the email body. Please do not send large files (>200KB) in this way.


  • 21 Feb 2005 : Development kit (training and validation data plus evaluation software) made available.
  • 14 March 2005: Test set made available
  • 31 March 2005: DEADLINE for submission of results
  • 4 April 2005: Results announced
  • 11 April 2005: Half-day (afternoon) challenge workshop held in Southampton, UK.

To aid administration of the challenge, entrants will be required to register when downloading the test set.

Publication policy

The main mechanism for dissemination of the results will be the challenge webpage. It is also possible that an overview paper of the results will be produced.


  • Luc van Gool (Zurich)
  • Chris Williams (Edinburgh)
  • Andrew Zisserman (Oxford)

Technical contributors

  • Mark Everingham (Oxford): VOC challenge [email]
  • Manik Varma (Oxford): PASCAL image databases

Efficient approximate inference in large Hybrid Networks (graphical models with discrete and continuous variables) is one of the major unsolved problems in machine learning, and insight into good solutions would be beneficial in advancing the application of sophisticated machine learning to a wide range  of real-world problems.

Such research would benefit potentially applications in Speech Recognition, Visual Object Tracking and Machine Vision, Robotics, Music Scene Analysis, Analysis of complex Times series, understanding and modelling complex computer networks, Condition monitoring, and other complex phenomena.

This theory challenge specifically addresses a central component area of PASCAL, namely Bayesian Statistics and statistical modelling, and is also related to the other central areas of Computational Learning, Statistical Physics and Optimisation techniques.
One aim of this challenge is to bring together leading researchers in graphical models and related areas to develop and improve on existing methods for tackling the fundamental intractability in HNs. We do not believe that there will necessarily emerge a single best approach, although we would expect that successes in one application area should be transferable to related areas. Many leading machine learning researches are currently working on applications that involve HNs, and we invite participants to suggest their own applications. Ideally, this would be in the form of a dataset along the lines of PASCAL.

Graphical Models
Graphical models are a powerful framework for formulating and solving difficult problems in machine learning. Being essentially a marriage between graph and probability theory, they provide a theoretically elegant approach to thinking about complex machine learning problems, and have had widespread success, being now one of the dominant approaches. A great many problems in machine learning can be formulated as latent or hidden variable problems in which an underlying mechanism, which cannot be directly observed, is responsible for generating observable values — based on these observations, we wish to learn something about the fundamental generating process.
Hybrid Networks
Hybrid networks are graphical models which contain both discrete and continuous variables. Often, but not exclusively, natural application areas arise in the temporal domain, for example speech recognition, or visual tracking, in which time plays a fundamental role; one reason for this is that many physical processes are inherently Markovian.
In the continuous case, a widely used model is the Kalman Filter (KF), which is based on a linear dynamical system with Gaussian additive noise. The discrete anologue of the Kalman Filter is the Hidden Markov Model (HMM), in which hidden states are discrete, and the output(visible) states may be discrete or continuous. Such models are used in leading speech recognition software and tracking applications. Both the KF and the HMM are computationally tractable. Recently, there has been a recognition that many machine learning models would naturally have both continuous (as in the KF) and discrete (as in the HMM) hidden variables.  

Hybrid Networks (HNs) are such Graphical Models with both continuous and discrete variables .  For example, imagine that a musical instrumentis played at a time t, which we can model with a switch variables(t)=1 : future sound generation can be modelled as a hiddenGaussian linear dynamical system with transition dynamicsp(h(t+1)|h(t),s(t)=1). When the musical instrument is at a future time t is turned off, s(t)=0, a different dynamics occurs, p(h(t+1)|h(t),s(t)=0). The observation (visible) process p(v(t)|h(t)) is typically a noise projection of the hidden state h(t), say in the case of sound, to a one dimensional pressure displacement. Based on the observed sequence v(1),…v(T), we may with to infer the switch variables s(1),..s(T). This particular kind of HN is called a  switching Kalman Filter.

The Challenge
A fundamental difficulty with Hybrid Networks is their intractability — unfortunately, the marriage of two tractable models, the Kalman Filter and Hidden Markov Model, does not result in a tractable Hybrid Network. Recently, developing approximate inference and learning methods in HNs has been a major research activity, across a large range of fields, including speech, music transcription, condition monitoring, control, robotics, and computer-brain interfaces.
Whist this challenge is largely theoretical in nature, in order to have at least one concrete problem, we will make available a dataset of acoustic recordings, for example of a live piano recording. The challenge would be infer what notes were played and when — that is, to perform a transcription of the performance. We will know the ground truth (since we generated the data), for which we can compare competing solutions. Music transcription is a difficult, largely unsolved problem, although initial attempts using HNs have demonstrated the effectiveness of the HN solution. However, making faster approximation schemes in this area would overcome the current barrier to commercialisation of such techniques.
Although we framed, for concreteness here, a music example, this is merely one instance of many application areas, and we do not expect any bias towards one particular application area.   


Goal of the proposed challenge is to assess the current situation concerning Machine Learning (ML) algorithms for Information Extraction (IE) from documents, identifying future challenges and to foster additional research in the field. The aim is to:

  1. Define a methodology for fair comparison of ML algorithms for IE.
  2. Define a publicly available resource for evaluation that will exist and be used beyond the lifetime of the challenge; such framework will be ML oriented, not IE oriented as so far proposed in other similar evaluations.
  3. Perform actual tests of different algorithms in controlled situations so to understand what works and what does not and therefore identify new future challenges.

Results of the challenges will be discussed in a workshop at the end of the evaluation. Moreover, such results will constitute material of discussion at the Dagstuhl workshop on Learning for the Semantic Web that some of the proposers are organizing for February 2005. Goal of the workshop is discussing strategies and challenges for Machine Learning for the Semantic Web. The drafting of a white paper for future research is among the objectives of the workshop as well. Document annotation via IE is one of the topics. We are currently seeking sponsorship by Pascal for the event. We believe that this workshop will be a good place to show results of a Pascal activity.

The proposed challenge will be partly sponsored by the European Project IST Dot.Kom ( Dot.Kom will fund the definition of the task, annotation of the corpora and implementation of the evaluation server. We seek expense refund from Pascal for the parts that cannot be legally claimed from Dot.Kom because covering activities specific to Pascal (mainly 1 person month for running the evaluation and preparation of the workshop).

IE, ML and Evaluation

Evaluation has a long history in Information Extraction (IE), mainly thanks to the MUC conferences, where most of the IE evaluation methodology (as well as most of the IE methodology as a whole) was developed (Hirschman 1998). In particular the DARPA/MUC evaluations produced and made available annotated corpora that have been used as standard testbeds. More recently, a variety of other corpora have been shared by the research community, such as; Califf’s job postings collection (Califf 1998), and Freitag’s seminar announcements, corporate acquisition, university Web page collections (Freitag 1988). Such corpora are available in the RISE repository, which contains a number of disparate corpora, without any specific common aim, being mainly corpora defined by independent researchers for evaluating their own systems and made available to others. Most of them are devoted to implicit relation extraction, i.e. the task mainly defined by the wrapper induction community, requiring the identification of implicit events and relations. For example (Freitag 1998) defines the task of extracting speaker, start-time, end-time and location from a set of seminar announcements. No explicit mention of the event (the seminar) is done in the annotation. Implicit event extraction is simpler than full event extraction, but has important applications whenever either there is just one event per texts or it is easy to devise extraction strategies for recognizing the event structure from the document (Ciravegna and Lavelli 2004).

However, the definition of an evaluation methodology and the availability of standard annotated corpora do not guarantee that the experiments performed with different approaches and algorithms proposed in the literature can be reliably compared. Some of the problems are common to other NLP tasks (e.g., see (Daelemans et al., 2003)); the difficulty of exactly identifying the effects on performances of the data used (the sample selection and the sample size), of the information sources used (the features selected), and of the algorithm parameter settings.

One issue specific to IE evaluation is how leniently to assess inexact identification of filler boundaries. (Freitag 1998) proposes three different criteria for matching reference instances and extracted instances: exact, overlap, contains. Another question concerns the possibility of multiple fillers for a slot and how the counting is performed. The problem is that this issue is often left implicit in papers. Finally, because of the complexity of the task, the limited availability of tools, and the difficulty of re-implementing published algorithms (usually quite complex and sometimes not fully described in papers), in IE there are very few comparative articles in the sense mentioned in (Daelemans et al., 2003). Most of the papers simply present the results of the new proposed approach and compare them with the results reported in previous articles. There is rarely any detailed analysis to ensure that the same methodology is used across different experiments.

In this challenge, we propose a framework for evaluating algorithms for implicit relation extraction where the details of the evaluation are set explicitly. The methodology will be applied to test different algorithms in a specific task (a public call will be issue to invite international research groups to participate in the evaluation), but it will be applicable to future evaluation tasks as well.

Expected Outcome

The outcome of the challenge will be:

  • A comparative evaluation of a number of state of the art algorithms for ML-based IE;
  • An evaluation strategy for future experiments also after the end of the challenge and for other tasks;
  • An implemented evaluation framework for the field including a scorer.

Difficulty that challengers will address :

Focus of challenge

The task will simulate human centred document annotation for Knowledge Management or the Semantic Web as found in tools like MnM (Vargas Vera et al. 2002), Melita (Ciravegna et al. 2002) and Ontomat (Handschuh et al. 2002). In these tools, the role of IE is to learn on-line from the user?s annotations and present every new document with suggestions derived by generalizing from the previous annotations and/or analysing the unannotated part. In many applications, when the IE system has learnt enough, the user exits the annotation loop and automatic annotation is provided for the new documents.

Crucial points for adaptive IE in this kind of environments are related to the noisy and limited training material. The annotated material is noisy because humans are imperfect; annotation tends to be a tiring and therefore error prone process. The annotated material is generally of limited size because in most cases users cannot / are not willing to provide more than a thousand annotations maximum. In addition, users generally require that the IE system starts learning very quickly and that suggestions are frequent and reliable (effectiveness and low intrusivity (Ciravegna et al.2002)). It is therefore important to study the adaptive algorithms in order to understand their behaviour when learning with limited amount of training material before using it in such an annotation environment. The ability to learn quickly and reliably is one of the aspects that we will evaluate, together with the ability to perform automatic annotation after a reasonable training.

The requirements of the human centred annotation mentioned above are common to other application areas and therefore the evaluation will be representative of the applicability of ML+IE for a number of other tasks.

Focus of the challenge will be on documents that in our experience tend to be quite common in real world applications, but that have been so far neglected by other evaluations of IE systems: semi-structured Web documents (web pages, emails, etc) where information is conveyed through two types of channels: language and format. These documents are semi-structured in the sense that they tend to contain free text but the use of formatting can make sentences choppy and semi-structured. An example is the main page of where there is a central panel containing free text, and a number of side panels containing lists (up to a couple of words) and titles (choppy sentences up to 10 words). In addition, much regularity can be found, especially in formatting, but such regularities are not as rigid as in pages produced by databases (as in For example, a personal bibliographic page such as is highly structured internally, but its structure is different from any other personal bibliographic page. Examples of semi-structured documents are seminar announcements, jobs posting, conference announcement web pages, etc.

The characteristics above make this task completely different from other analogous tasks for a number of reasons:

  • Use of semi-structured texts: previous evaluations such as CONNL and MUC have focused on newspaper documents. Semi-structured documents are a class of documents that has completely different characteristics and it is very important from the application point of view; as mentioned there is no currently established methodology for such kind of documents (Lavelli et al 2004);
  • Evaluation of implicit relations: this is a task that has never been evaluated in any competitive evaluation, although it is an area with great application potentialities: currently some tools use this approach for commercial applications (Ciravegna and Lavelli 2004); an assessment of the IE capabilities of ML-algorithms is quite timely.
  • Machine learning oriented study:
    • We will study the way different algorithms behave in different situations, with different availability of training material, in order to simulate a wide variety of application situations. In addition to the study of the classic task of extracting information given separated training and test sets, we will also focus on more ML oriented evaluations such as tracing the learning curve when the training material is increased progressively. We also will investigate the behaviour of algorithms able to exploit unannotated material. We believe that this kind of evaluation is much more interesting for the ML community than the standard IE scenario;
    • Most of the evaluations carried out so far were taken from an IE perspective. For example all the MUC tasks but Named Entity Recognition (NER) required the composition of a number of tasks such as NER, coreference resolution, template filling, etc. This composition risks obscuring the contribution of the ML algorithms to the IE task.

In our evaluation, we intend to evaluate ML algorithms on implicit relation recognition; a task that is more complex than generic NER but that allows clearly stating the contribution of the ML algorithms.

Describe the expected impact that the challenge could have on research in PASCAL fields?

Goal of the proposed challenge is to assess the current situation concerning Machine Learning (ML) algorithms for Information Extraction (IE) from documents, identifying future challenges and to foster additional research in the field. The aim is to:

  1. Define a methodology for fair comparison of ML algorithms for IE.
  2. Define a publicly available resource for evaluation that will exist and be used beyond the lifetime of the challenge; such framework will be ML oriented, not IE oriented as so far proposed in other similar evaluations.
  3. Perform actual tests of different algorithms in controlled situations so to understand what works and what does not and therefore identify new future challenges.

Results of the challenges will be discussed in a workshop at the end of the evaluation. Moreover, such results will constitute material of discussion at the Dagstuhl workshop on Learning for the Semantic Web that some of the proposers are organizing for February 2005. Goal of the workshop is discussing strategies and challenges for Machine Learning for the Semantic Web. The drafting of a white paper for future research is among the objectives of the workshop as well. Document annotation via IE is one of the topics. We are currently seeking sponsorship by Pascal for the event. We believe that this workshop will be a good place to show results of a Pascal activity.

Number of teams which would be interested in participating in the challenge In the last years, more than 30 international groups have so far presented results of ML algorithms tested on some of the currently available corpora for implicit relation definition such as the CMU Seminar Announcements (Freitag 1998), the Job Postings (Califf 1998), Academic Web Pages (Freitag 1998) and all the corpora for wrapper induction so far proposed (e.g. the Zagat corpus).

Groups include; the University of Sheffield (UK), University College Dublin (IRL), ITC-Irst (I), Roma Tor Vergata (I), the National Centre for Scientific Research Demokritos (G), University of Antwerp (B), Carnegie Mellon University (USA), University of Washington (USA), University of Utah (USA), Cornell University (USA), University of Illinois UC (USA), the University of Texas at Austin (USA), USC-Information Science Institute (USA), Stanford University (USA). Some of them have been already contacted and showed availability to participate in this challenge. We believe that this challenge will have a high international profile and to be an excellent vehicle for impacting a quite large community.

How long will it take to collect and preprocess any associated datasets?

The collection of the corpus for the task will be done via the Web querying Google (<1 day). Three annotators will annotate manually the 600 documents (<1 week time). Preprocessing via Gate will require less than a day. Preparation of the server for evaluation will require about two weeks of person time. Overall, we expect the preparation of the task to require about one month time and about 8 weeks person time.

On what date will these associated datasets be available?

End of June 2004

Timescale for organizing the challenge:

  • June 2004: availability of formal definition of the task and of annotated corpus, together with the evaluation server.
  • October 2004: formal evaluation;
  • November 2004: workshop.
  • February 2005: Dagstuhl workshop on learning for the Semantic Web

How will the results of the challenge be evaluated?

We will collect a corpus of 1,100 conference workshop call for papers (CFP) from the Web; 600 will be annotated, 500 will be left unannotated. Workshops from a variety of fields will be sampled, e.g. Computer Science, Biomedical, Psychology. However, due to their prevalence on the Web, the majority of the documents are likely to be Computer Science based. The exact task will be defined during the preparation phase, but we expect to require extraction of:

  • Name of Workshop
  • Acronym of Workshop
  • Date of Workshop
  • Location of Workshop
  • Name of Conference
  • Date of Conference
  • Homepage of Conference
  • Location of Conference
  • Registration Date of Workshop
  • Submission Date of Workshop
  • Notification Date of Workshop
  • Camera Ready Copy Date of Workshop
  • Programme Char/Co-chairs of Workshop (name plus affiliation)

In the preparation phase, we will define the exact experimental setup (both the numerical proportions between the training and test sets and the procedure adopted to select the documents). The experimental setup mentioned in the following is representative of the direction of work, but further discussion is still needed. We will also specify all of the following: (1) a set of fields to extract, (2) the legal numbers of fillers for each field, (3) the possibility of multiple varying occurrences of any particular filler and (4) how stringently matches are evaluated (exact, overlap or contains).

We will define and implement an evaluation server for the preliminary testing and for testing the final results. This server will be based on the MUC scorer (Douthat 1998). We will define the exact matching strategies by providing the configuration file for each of the tasks selected. Finally we will set up a public location where people will be able to store other new future corpora and expected results, together with the guidelines to be strictly followed for the evaluation. This will guarantee a reliable comparison of the performance of different algorithms even after the PASCAL competition is over. Moreover it will allow further fair evaluations settings.

Corpora will be annotated using Melita (Ciravegna et al. 2002), an existing tool that is already under use in scientific and commercial evaluation. Inter-annotator agreement will be guaranteed by a procedure where three annotators will be given overlapping sets of 600 documents to annotate. Discrepancies in annotations (computed automatically by a program) will be discussed among annotators. Annotation will be performed in stages (e.g. 30, 100, 300, 600 documents) with discussion of strategies and discrepancies after every stage.

Before the beginning of the evaluation, the corpora will be preprocessed using an existing NLP system : documents will be tokenized, annotated with POS tagging, gazetteer information and named entities. The different algorithms will have to use this preprocessed data . This is in order to ensure that they have all access to the same information: in this way we believe that we will be able to measure the algorithm?s ability on a fair and equal base, as already done in other evaluations such as CONNL. Moreover, this will allow researchers to concentrate on the task of learning without having to spend time on the linguistic pre-processor. We also believe that in this way we will enable the participation in the task of researchers with limited or no knowledge of language analysis: they will not risk to be penalised for their inability to define a good linguistic pre-processor. The pre-processing results will be provided as produced by the system; no human correction will be performed. This is so to allow the presence of noise given by real application environments.


The corpus will be constituted of 600 annotated and 500 unannotated documents. In order to provide a realistic test scenario the 200 most recent annotated CFP will form the test set. The remaining 400 annotated documents will be divided into four, 100 document, partitions enabling comparative cross-validation experiments to be performed.

The three tasks described below will be evaluated. Each participant can decide to participate in any of tasks, but participation in task 1 is mandatory. Participants will be asked to use the preprocessed form of the corpus, but an optional task (evaluated separately) will also allow algorithms to use a different pre-processor.

TASK1: Full scenario: Learning to annotate implicit information.

Given 400 annotated documents, learn to extract information. Each algorithm provides results of a four-fold cross-validation experiment using the same document partitions for pre-competitive tests. The main goal of this task is to evaluate the ability of the system to learn how to extract information given a closed world (200 most recent annotated documents). The task will measure the ability to generalize over a limited amount of training material in an environment with a large amount of sparse data.

TASK2: Active learning: Learning to select documents

In this task, the same corpus of 600 documents mentioned above will be used; 400 as training documents and 200 as test documents. Baseline: given fixed subsets of the training corpus of increasing size (e.g. 10, 20, 30, 50, 75, 100, 150, 200), show the learning ability on the full test corpus. Advanced: given an initial number of annotated documents as a seed (e.g. 10), select training subsets of increasing size (e.g. 20, 30, 50, 75, 100, 150, 200) in order to show the algorithm?s ability to select the most suitable set of training documents from an unannotated pool.

Each algorithm’s results will be plotted on a chart in order to study its learning curve and to allow better understanding the results obtained in TASK1. Moreover, the ability to quickly reach reliable results is an important feature of any adaptive IE system supporting annotation (Ciravegna et al. 2002), so the study of the learning curve will allow to access the suitability of the algorithm for online learning.

TASK3: Enriched Scenario

Same as the full scenario, but the algorithms will be able to use a richer set of information sources. In particular, we will focus on using the unannotated part of the corpus (500 documents). Goal: study how unsupervised or semi-supervised methods can improve the results of supervised approaches. An interesting variant of this task could concern the use of unlimited resources, e.g. the Web.


  • (Califf 1998) Califf, M. and R. Mooney, 2003. Bottom-up relational learning of pattern matching rules for information extraction. Journal of Machine Learning Research, 4:177-210.
  • (Ciravegna et al.2002) Fabio Ciravegna, Alexiei Dingli, Daniela Petrelli, and Yorick Wilks 2002. User-system cooperation in document annotation based on information extraction. In Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management, EKAW02. Springer Verlag.
  • (Ciravegna and Lavelli 2004) Fabio Ciravegna and Alberto Lavelli 2004: LearningPinocchio: Adaptive Information Extraction for Real World Applications Journal of Natural Language Engineering, 10 (2).
  • (Daelemans et al., 2003) Daelemans, Walter and Vèronique Hoste, 2002. Evaluation of machine learning methods for natural language processing tasks. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002). Las Palmas, Spain.
  • (Douthat 1998) Douthat, A., 1998. The message understanding conference scoring software user?s manual. In Proceedings of the 7th Message Understanding Conference (MUC-7).
  • (Freitag 1998) Freitag, Dayne, 1998. Machine Learning for Information Extraction in Informal domains. Ph.D. thesis, Carnegie Mellon University
  • (Handschuh et al. 2002) S. Handschuh, S. Staab, and F. Ciravegna, 2002: S-CREAM- Semi-automatic CREAtion of Metadata. In Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management, EKAW02. Springer Verlag, 2002.
  • (Hirschman 1998) Hirschman, Lynette, 1998. The evolution of evaluation: Lessons from the message understanding conferences. Computer Speech and Language, 12:281-305.
  • (Lavelli et al.2004) A. Lavelli, M. E. Califf, F. Ciravegna, D. Freitag, C. Giuliano, N. Kushmerick, L. Romano, 2003: A Critical Survey of the Methodology for IE Evaluation, Proceedings of the 3rd LREC Conference, Crete, May 2004.
  • (Vargas Vera et al. 2002) M. Vargas-Vera, Enrico Motta, J. Domingue, M. Lanzoni, A. Stutt, and F. Ciravegna, 2002. MnM: Ontology driven semi-automatic or automatic support for semantic markup. In Proc. of the 13th International Conference on Knowledge Engineering and Knowledge Management, EKAW02. Springer Verlag, 2002.

Recognising Textual Entailment Challenges

Textual Entailment Recognition was proposed recently as a generic task that captures major semantic inference needs across many natural language processing applications, such as Question Answering (QA), Information Retrieval (IR), Information Extraction (IE), and (multi) document summarization. This task requires to recognise, given two text fragments, whether the meaning of one text is entailed (can be inferred) from the other text.

The First Recognising Textual Entailment Challenge

(RTE 1)

The first PASCAL Recognising Textual Entailment Challenge (15 June 2004 – 10 April 2005) provided the first benchmark for the entailment task. The challenge raised noticeable attention in the research community, attracting 17 submissions from research groups worldwide. The relatively low accuracy achieved by the participating systems suggests that the entailment task is indeed a challenging one, with a wide room for improvement.

Challenge citation: Please use the following citation when referring to the RTE challenge:
Ido Dagan, Oren Glickman and Bernardo Magnini. The PASCAL Recognising Textual Entailment Challenge. In Quiñonero-Candela, J.; Dagan, I.; Magnini, B.; d’Alché-Buc, F. (Eds.), Machine Learning Challenges. Lecture Notes in Computer Science, Vol. 3944, pp. 177-190, Springer, 2006.


Recent years have seen a surge in research of text processing applications that perform semantic-oriented inference about concrete text meanings and their relationships. Even though many applications face similar underlying semantic problems, these problems are usually addressed in an application oriented manner. Consequently it is difficult to compare, under a generic evaluation framework, semantic methods that were developed within different applications. The PASCAL Challenge introduces textual entailment as a common task and evaluation framework for Natural Language Processing, Information Retrieval and Machine Learning researchers, covering a broad range of semantic-oriented inferences needed for practical applications. This task is therefore suitable for evaluating and comparing semantic-oriented models in a generic manner. Eventually, work on textual entailment may promote the development of generic semantic “engines”, which will play an analogous role to that of generic syntactic analyzers across multiple applications.

Textual Entailment

Textual entailment recognition is the task of deciding, given two text fragments, whether the meaning of one text is entailed (can be inferred) from another text. This task captures generically a broad range of inferences that are relevant for multiple applications. For example, a Question Answering (QA) system has to identify texts that entail the expected answer. Given the question “Who killed Kennedy?”, the text “the assassination of Kennedy by Oswald” entails the expected answer form “Oswald killed Kennedy”. Similarly, in Information Retrieval (IR) the concept denoted by a query expression should be entailed from relevant retrieved documents. In multi-document summarization a redundant sentence or expression, to be omitted from the summary, should be entailed from other expressions in the summary. In Information Extraction (IE) entailment holds between different text variants that express the same target relation. And in Machine Translation evaluation a correct translation should be semantically equivalent to the gold standard translation, and thus both translations have to entail each other. Thus, in a similar spirit to Word Sense Disambiguation and Named Entity Recognition which are recognized as generic tasks, modeling textual entailment may consolidate and promote broad research on applied semantic inference.

Task Definition

Participants in the evaluation exercise will be provided with pairs of small text snippets (one or more sentences in English), which we term Text-Hypothesis (T-H) pairs. The data set will include over 1000 English T-H pairs from the news domain (political, economical, etc.). Examples will be manually tagged for entailment (i.e. whether T entails H or not) by human annotators and will be divided into a Development Set (one third of the data) and a Test Set (two thirds of the data). Participating systems will have to decide for each T-H pair whether T indeed entails H or not, and results will be compared to the manual gold standard.

The dataset will be collected with respect to different text processing applications, such as question answering, information extraction, information retrieval, multi-document summarization, paraphrase acquisition, and machine translation. Each portion of the dataset will include typical T-H examples that correspond to success and failure cases of actual applications. The examples will represent different levels of entailment reasoning, such as lexical, syntactic, morphological and logical.

The goal of the challenge is to provide a first opportunity for presenting and comparing possible approaches for modeling textual entailment. In this spirit, we aim at an explorative rather than a competitive setting. While participant results will be reported there will not be an official ranking of systems. A development set will be released first to give an early impression of the different types of test examples. The test set will be released two months prior to the result submission date, but, of course, reported systems are expected to be generic in nature. We regard it as acceptable to run automatic knowledge acquisition methods (such as synonym collection) specifically for the lexical and syntactic constructs that are present in the test data, as long as the methodology is general and the cost of running the learning/acquisition procedure at full scale can be reasonably estimated.





Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc last year. Yahoo bought Overture.


Microsoft’s rival Sun Microsystems Inc. bought Star Office last month and plans to boost its development as a Web-based device running over the Net on personal computers and Internet appliances. Microsoft bought Star Office.


The National Institute for Psychobiology in Israel was established in May 1971 as the Israel Center for Psychobiology by Prof. Joel. Israel was established in May 1971.


Since its formation in 1948, Israel fought many wars with neighboring Arab countries. Israel was established in 1948.


Putting hoods over prisoners’ heads was also now banned, he said. Hoods will no longer be used to blindfold Iraqi prisoners.


The market value of u.s. overseas assets exceeds their book value. The market value of u.s. overseas assets equals their book value.


for registration and further information and inquiries contact Oren Glickman <>.

Organizing Committee

Ido Dagan (Bar Ilan University, Israel)

Oren Glickman (Bar Ilan University, Israel)

Bernardo Magnini (ITC-irst, Istituto per la Ricerca Scientifica e Tecnologica, Italy)