Motivation

A fundamental phenomenon of natural language is the variability of semantic expression, where the same meaning can be expressed by or inferred from different texts. Many natural language processing applications, such as Question Answering (QA), Information Retrieval (IR), Information Extraction (IE), and (multi) document summarization need to model this variability in order to recognize that a particular target meaning can be inferred from different text variants. Even though many applications face similar underlying semantic problems, these problems are usually addressed in an application-oriented manner. Consequently it is difficult to compare, under a generic evaluation framework, semantic methods that were developed within different applications. The PASCAL RTE Challenge introduces textual entailment as a common task and evaluation framework, covering a broad range of semantic-oriented inferences needed for practical applications. This task is therefore suitable for evaluating and comparing semantic-oriented models in a generic manner. Eventually, work on textual entailment may promote the development of general semantic “engines”, which will be used across multiple applications.

Textual Entailment

Textual entailment recognition is the task of deciding, given two text fragments, whether the meaning of one text is entailed (can be inferred) from another text (see the Instructions tab for the specific operational definition of textual entailment assumed in the challenge). This task captures generically a broad range of inferences that are relevant for multiple applications. For example, a Question Answering (QA) system has to identify texts that entail the expected answer. Given the question “Who killed Kennedy?”, the text “the assassination of Kennedy by Oswald” entails the expected answer “Oswald killed Kennedy”. Similarly, in Information Retrieval (IR) the concept denoted by a query expression should be entailed from relevant retrieved documents. In multi-document summarization a redundant sentence or expression, to be omitted from the summary, should be entailed from other expressions in the summary. In Information Extraction (IE) entailment holds between different text variants that express the same target relation. And in Machine Translation evaluation a correct translation should be semantically equivalent to the gold standard translation, and thus both translations have to entail each other. Thus, modeling textual entailment may consolidate and promote broad research on applied semantic inference.

Task Definition

Participants in the evaluation exercise are provided with pairs of small text snippets (one or more sentences in English), which we term Text-Hypothesis (T-H) pairs. Examples were manually tagged for entailment (i.e. whether T entails H or not) by human annotators and will be divided into a Development Set, containing 800 pairs, and a Test Set, containing 800 pairs. Participating systems will have to decide for each T-H pair whether T indeed entails H or not, and results will be compared to the manual gold standard.

The goal of the RTE challenges is to provide opportunities for presenting and comparing possible approaches for modeling textual entailment. In this spirit, we aim at an explorative rather than a competitive setting. While participant results will be reported there will not be an official ranking of systems. A development set is released first to provide typical examples of the different types of test examples. The test set will be released three weeks prior to the result submission date. We regard it as acceptable to run automatic knowledge acquisition methods (such as synonym collection from corpora or the Web) specifically for the linguistic constructs that are present in the test data, as long as the methodology is general and fully automated, and the cost of running the learning/acquisition procedure at full scale can be reasonably estimated.

Dataset Collection and Application Settings

The dataset of Text-Hypothesis pairs was collected by human annotators. It consists of four subsets, which correspond to typical success and failure settings in different applications (as listed below). Within each application setting the annotators selected both positive entailment examples (annotated YES), where T does entail H, as well as negative examples (annotated NO), where entailment does not hold (50%-50% split). Some T-H examples appear in the Instructions section. H is a (usually short) single sentence, and T consists of one or two sentences.

One of the main goals for the RTE-2 dataset is to provide more realistic text-hypothesis examples, originating from actual applications. Therefore, the examples are based mostly on outputs of existing web-based systems (see Acknowledgments below). We allowed only minimal correction of texts extracted from the web, e.g. fixing spelling and punctuation but not style, therefore the English of some of the pairs is less-than-perfect.

Information Retrieval (IR):

In this application setting, the hypotheses are propositional IR queries, which specify some statement, e.g. “Alzheimer’s disease is treated using drugs”. The hypotheses were adapted and simplified from standard IR evaluation datasets (TREC and CLEF). Texts (T) that do or do not entail the hypothesis were selected from documents retrieved by different search engines (e.g. Google, Yahoo and Microsoft) for each hypothesis. In this application setting it is assumed that relevant documents (from an IR perspective) must necessarily entail the hypothesis.

Multi-document summarization (SUM):

In this setting T and H are sentences taken from a news document cluster, a collection of news articles that describe the same news item. Annotators were given output of multi-document summarization systems, including the document clusters and the summary generated for each cluster. The annotators picked sentence pairs with high lexical overlap, preferably where at least one of the sentences was taken from the summary (this sentence usually played the role of T). For positive examples, the hypothesis was simplified by removing sentence parts, until it was fully entailed by T. Negative examples were simplified in the same manner. This process simulates the need of a summarization system to identify information redundancy, which should be avoided in the summary.

Question Answering (QA):

Annotators were given questions and the corresponding answers returned by QA systems. Transforming a question-answer pair to text-hypothesis pair consisted of the following stages: First, the annotators picked from the answer passage an answer term of the expected answer type, either a correct or an incorrect one. Then, the annotators turned the question into an affirmative sentence with the answer term “plugged in”. These affirmative sentences serve as the hypotheses (H), and the original answer passage serves as the text (T). For example, given the question, “Who is Ariel Sharon?” and an answer text “Israel’s prime Minister, Ariel Sharon, visited Prague” (T), the question is turned into the statement “Ariel Sharon is the Israeli Prime Minister” (H), producing a positive entailment pair. This process simulates the need of a QA system to verify that the retrieved passage text indeed entails the provided answer.

Information Extraction (IE):

This task is inspired by the Information Extraction (and Relation Extraction) application, adapting the setting to having pairs of texts rather than a text and a structured template. The pairs were generated using 4 different approaches. In the first approach, ACE-2004 relations (the relations tested in the ACE-2004 RDR task) were taken as templates for hypotheses. Relevant news articles were then collected as texts (T) and corresponding hypotheses were generated manually based on the ACE templates and slot fillers taken from the text. For example, given the ACE relation ‘X work for Y‘ and the text “An Afghan interpreter, employed by the United States, was also wounded.” (T), a hypothesis “An interpreter worked for Afghanistan.” is created, producing a non-entailing (negative) pair. In the second approach, the MUC-4 annotated dataset was similarly used to create entailing pairs. In the third approach, the outputs of actual IE systems, for both the MUC-4 dataset and the news articles collected for the ACE relations, were used to generate entailing and non-entailing pairs. In the forth approach, new types of hypotheses that may correspond to typical IE relations were manually generated for different sentences in the collected news articles. These processes simulate the need of IE systems to recognize that the given text indeed entails the semantic relation that is expected to hold between the candidate template slot fillers.

 Challenge Organizers

Bar-Ilan University, Israel (Coordinator): Ido Dagan, Roy Bar-Haim, Idan Szpektor
Microsoft Research, USA: Bill Dolan
MITRE, USA: Lisa Ferro
CELCT, Trento – Italy: Bernardo Magnini, Danilo Giampiccolo

 

Acknowledgments

The following sources were used in the preparation of the data:

  • AnswerBus question answering system, provided by Zhiping Zheng, Computational Linguistics Department, Saarland University.
  • PowerAnswer question answering system, from Language Computer Corporation, provided by Dan Moldovan, Abraham Fowler, Christine Clark, Arthur Dexter and Justin Larrabee.
  • Columbia NewsBlaster multi-document summarization system, from the Natural Language Processing group at Columbia University’s Department of Computer Science.
  • NewsInEssence multi-document summarization system, provided by Dragomir R. Radev and Jahna Otterbacher from the Computational Linguistics And Information Retrieval research group, University of Michigan.
  • IBM’s information extraction system, provided by Salim Roukos and Nanda Kambhatla, I.B.M. T.J. Watson Research Center.
  • New York University’s information extraction system, provided by Ralph Grishman, Department of Computer Science, Courant Institute of Mathematical Sciences, New York University.
  • ITC-irst’s information extraction system, provided by Lorenza Romano, Cognitive and Communication Technologies (TCC) division, ITC-irst, Trento, Italy.
  • MUC-4 information extraction dataset, from the National Institute of Standards and Technology (NIST).
  • TREC-QA question collections, from the National Institute of Standards and Technology (NIST).
  • CLEF-QA question collections, from DELOS Network of Excellence for Digital Libraries.

We would like to thank the people and organizations that made these sources available for the challenge. In addition, we thank Oren Glickman and Dan Roth for their assistance and advice.

We would also like to acknowledge the people and organizations involved in creating and annotating the data: Malky Rabinowitz, Dana Mills, Ruthie Mandel, Errol Hayman, Vanessa Sandrini, Allesandro Valin, Elizabeth Lima, Jeff Stevenson, Amy Muia and the Butler Hill Group.

Datasets

development set
Note: if you downloaded the RTE-2 development set between 26/3/2007 (after RTE-3 results submission deadline), and 27/4/2008, please download it again from the above link, as it was found that this link temporarily pointed to an incorrect version of the development set (there was no problem with the test set)

test set
annotated test set
Preprocessed versions of the development and test set, including sentence splitting and dependency parsing, can be found here.

Errata: a slightly modified version of the development set which fixes minor typological errors in two pairs: #378 and #480, is available from the Stanford NLP group:

development set – Stanford

RTE-1 datasets

Task Definition

We consider an applied notion of textual entailment, defined as a directional relation between two text fragments, termed T the entailing text, and H the entailed text. We say that T entails H if, typically, a human reading T would infer that H is most likely true. This somewhat informal definition is based on (and assumes) common human understanding of language as well as common background knowledge. Textual entailment recognition is the task of deciding, given T and H, whether T entails H.

ID TEXT HYPOTHESIS TASK ENTAILMENT
1 The drugs that slow down or halt Alzheimer’s disease work best the earlier you administer them. Alzheimer’s disease is treated using drugs. IR YES
2 Drew Walker, NHS Tayside’s public health director, said: “It is important to stress that this is not a confirmed case of rabies.” A case of rabies was confirmed. IR NO
3 Yoko Ono unveiled a bronze statue of her late husband, John Lennon, to complete the official renaming of England’s Liverpool Airport as Liverpool John Lennon Airport. Yoko Ono is John Lennon’s widow. QA YES
4 Arabic, for example, is used densely across North Africa and from the Eastern Mediterranean to the Philippines, as the key language of the Arab world and the primary vehicle of Islam. Arabic is the primary language of the Philippines. QA NO
5 About two weeks before the trial started, I was in Shapiro’s office in Century City. Shapiro works in Century City. QA YES
6 Meanwhile, in his first interview to a Western print publication since his election as president of Iran earlier this year, Ahmadinejad attacked the “threat” to bring the issue of Iran’s nuclear activity to the UN Security Council by the US, France, Britain and Germany. Ahmadinejad is a citizen of Iran. IE YES

Table 1: Example T-H pairs

Some additional judgment criteria and guidelines are listed below (examples are taken from Table 1):

  • Entailment is a directional relation. The hypothesis must be entailed from the given text, but the text need not be entailed from the hypothesis.
  • The hypothesis must be fully entailed by the text. Judgment would be NO if the hypothesis includes parts that cannot be inferred from the text.
  • Cases in which inference is very probable (but not completely certain) are judged as YES. In example #5 one could claim that although Shapiro’s office is in Century City, he actually never arrives to his office, and works elsewhere. However, this interpretation of T is very unlikely, and so the entailment holds with high probability. On the other hand, annotators were guided to avoid vague examples for which inference has some positive probability which is not clearly very high.
  • Our definition of entailment allows presupposition of common knowledge, such as: a company has a CEO, a CEO is an employee of the company, an employee is a person, etc. For instance, in example #6, the entailment depends on knowing that the president of a country is also a citizen of that country.

Data Sets and Format

Both Development and Test sets are formatted as XML files, as follows:

<pair id=”id_num” entailment=”YES|NO” task=”task_acronym”>

<t> the text… </t>

<h> the hypothesis… </h>

</pair>

Where:

  • each T-H pair appears within a single <pair> element.
  • the element <pair> has the following attributes:
  • id, a unique numeral identifier of the T-H pair.
  • task, the acronym of the application setting from which the pair has been generated (see introduction): ‘IR’,’IE’,’QA’ or ‘SUM’.
  • entailment (in the development set only), the gold standard entailment annotation, being either ‘YES’ or ‘NO’.
  • the element <t> (text) has no attributes, and it may be made up of one or more sentences.
  • the element <h> (hypothesis) has no attributes, and it usually contains one simple sentence.

The data is split to a development set and a test set, to be released separately. The goal of the development set is to guide the development and tuning of participating systems. Notice that since the given task has an unsupervised nature it is not expected that the development set can be used as a main resource for supervised training, given its anecdotal coverage. Rather it is typically assumed that systems will be using generic techniques and resources.

Data preprocessing

Following requests made in the first RTE challenge, this year we preprocessed the text and hypothesis of each pair. The preprocessing includes sentence splitting, using MXTERMINATOR (Reynar and Ratnaparkhi, 1997) and dependency parsing using MINIPAR (Lin, 1998).  See the “Datasets” tab for further details. Using the pre-processed data is optional, and it is allowed, of course, to use alternative tools for preprocessing.  Note that since the preprocessing is done automatically it does contain some errors. We provide this data as-is, and give no warranty on the quality of the pre-processed data.

Submission

Systems should tag each T-H pair as either YES, predicting that entailment does hold for the pair, or as NO otherwise. As indicated originally, this year partial submissions are not allowed – the submission should cover the whole dataset.

Systems should be developed based on the development data set. Analyses of the test set (either manual or automatic) should not impact in any way the design and tuning of systems that publish their results on the RTE-2 test set. We regard it as acceptable to run automatic knowledge acquisition methods (such as synonym collection) specifically for the lexical and syntactic constructs that will be present in the test set, as long as the methodology and procedures are general and not tuned specifically for the test data. In any case, participants are asked to report about any process that was performed specifically for the test set.

Evaluation Measures

The evaluation of all submitted runs will be automatic. The judgments (classifications) returned by the system will be compared to those manually assigned by the human annotators (the Gold Standard). The percentage of matching judgments will provide the accuracy of the run, i.e. the fraction of correct responses.

As a second measure, an Average Precision measure will be computed. This measure evaluates the ability of systems to rank all the T-H pairs in the test set according to their entailment confidence (in decreasing order from the most certain entailment to the least certain). The more the system is confident that T entails H, the higher the ranking is. A perfect ranking would place all the positive pairs (for which the entailment holds) before all the negative pairs. Average precision is a common evaluation measure for system rankings, and is computed as the average of the system’s precision values at all points in the ranked list in which recall increases, that is at all points in the ranked list for which the gold standard annotation is YES (Voorhees and Harman, 1999). More formally, it can be written as follows:

1/R * sum for i=1 to n (E(i) *#-correct-up-to-pair-i/i)

where n is the number of the pairs in the test set, R is the total number of positive pairs in the test set, E(i) is 1 if the i-th pair is positive and 0 otherwise, and i ranges over the pairs, ordered by their ranking.

Note the difference between this measure and the Confidence Weighted Score used in the first challenge.

This score will be computed for systems that will provide as output a confidence-ranked list of all test examples (in addition to the YES/NO output for each example).

Results Submission Format

Results will be submitted in a file with one line for each T-H pair in the test set, in the following format:

pair_id<blank space>judgment

where

  • pair_id is the unique identifier of each T-H pair, as it appears in the test set.
  • judgment is either YES or NO.

The first line in the file should look as follows:

ranked:<blank space>yes/no

The first line indicates whether the submission includes confidence ranking of the pairs (see evaluation measures above). Average precision will be computed only for systems that specify “ranked: yes” in the first line. If the submission includes confidence ranking, the pairs in the file should be ordered by decreasing entailment confidence order: the first pair should be the one for which the entailment is most certain, and the last pair should be the one for which the entailment is least likey (i.e. the one for which the judgment as “NO” is the most certain). Thus, in a ranked list all the positively classified pairs are expected to appear before all those that were classified as negative.

For example, suppose that the pair identifiers in the test set are 1…6. A valid submission file that includes ranking may look as follows:

ranked: yes

4 YES

3 YES

6 YES

1 NO

5 NO

2 NO

Participating teams will be allowed to submit results of up to 2 systems. The corresponding result files should be named run1.txt, and run2.txt if a second run is submitted.

The results files should be zipped and submitted via the submit form.

System Reports

Participants are requested to submit a full system report by February 21 (we have a tight schedule this year due to the EACL 2006 conference in Trento, scheduled right before the RTE 2 Workshop).  Reports should include up to 6 double column pages, using the ACL style files and guidelines. The reports should include system description, quantitative results on the test set, and qualitative and quantitative analysis of the results and system behavior. Reviews with comments for the camera ready version and decisions about presentation in the workshop will be sent back to the authors on early March.

Reports are to be sent by email to Benardo Magnini (magnini@itc.it) with the following subject line: “RTE2 report submission”.

We aim to have an interactive and informative workshop setting, which will enable exploring the space of the entailment phenomenon and alternative methods to approach it. Therefore we encourage, apart from the straightforward system description, some analysis of results and general observations you might have about the entailment recognition problem. We strongly believe that to understand and interpret the results of a system the plain score is not sufficiently informative. In particular, we advocate including:

  • General observations and analysis of the entailment phenomena, data types, problems, etc.
  • Analysis of system performance – analysis and characterization of success and failure cases, identifying inherent difficulties and limitations of the system vs. its strengths.
  • Description of the types of features used by the system; getting a feeling (with examples) for the concrete features that actually made an impact.
  • Noting if there is a difference in performance between the development and test set. Identifying the reasons; was there over fitting?
  • If your system (as described in the report) is complex – identifying which parts of the system were eventually effective vs. parts that were not crucial or even introduced noise, illustrating the different behaviors through examples.
  • In case the development set was used for learning detailed information (like entailment rules) – discussing whether the method is scalable.
  • Providing illustrative examples for system behavior.
  • Providing results also for the development set.

 

Camera ready report:

Camera ready version of the report, to be included in the workshop proceedings, should be submitted in pdf format (with no page numbers) by March 14.

References

  1. Lin. 1998. Dependency-based evaluation of MINIPAR. In Proceedings of the Workshop on Evaluation of Parsing Systems at LREC 1998. Granada, Spain.
  2. C. Reynar and A. Ratnaparkhi. 1997. A Maximum Entropy Approach to Identifying Sentence Boundaries. In Proceedings of the Fifth Conference on Applied Natural Language Processing,March 31-April 3. Washington, D.C.
  3. Voorhees and D. Harman. 1999. Overview of the seventh text retrieval conference. In Proceedings of the Seventh Text Retrieval Conference (TREC-7). NIST Special Publication.

Results

Download RTE-2 evaluation script: RTE2_evaluate.pl (requires perl)

Usage: perl RTE2_evaluate.pl annotated_xml_filename <run_filename

(RTE-2 annotated test set is available here)

RTE-2 submitted runs and results

Notice: Proper scientific methodology requires that testing should be blind. Therefore, if you plan to further use the RTE-2 test set for evaluation, it is advisable that you will not perform any analysis of this data set, including the detailed information provided in these runs.

download all runs

# First Author (Group) Run Accuracy Average Precision
1 Adams (Dallas) run1 0.6262 0.6282
2 Bos (Rome & Leeds) run1 0.6162 0.6689
run2 0.6062 0.6042
3 Burchardt (Saarland) run1 0.5900
run2 0.5775
4 Clarke (Sussex) run1 0.5275 0.5254
run2 0.5475 0.5260
5 de Marneffe (Stanford) run1 0.5763 0.6131
run2 0.6050 0.5800
6 Delmonte (Venice) run1* 0.5563 0.5685
7 Ferr?ndez (Alicante) run1 0.5563 0.6089
run2 0.5475 0.5743
8 Herrera (UNED) run1 0.5975 0.5663
run2 0.5887
9 Hickl (LCC) run1 0.7538 0.8082
10 Inkpen (Ottawa) run1 0.5800 0.5751
run2 0.5825 0.5816
11 Katrenko (Amsterdam) run1 0.5900
run2 0.5713
12 Kouylekov (ITC-irst & Trento) run1 0.5725 0.5249
run2 0.6050 0.5046
13 Kozareva (Alicante) run1 0.5487 0.5589
run2 0.5500 0.5485
14 Litkowski (CL Research) run1 0.5813
run2 0.5663
15 Marsi (Tilburg & Twente) run1 0.6050
16 Newman (Dublin) run1 0.5250 0.5052
run2 0.5437 0.5103
17 Nicholson (Melbourne) run1 0.5288 0.5464
run2 0.5088 0.5053
18 Nielsen (Colorado) run1* 0.6025 0.6396
run2* 0.6112 0.6379
19 Rus (Memphis) run1 0.5900 0.6047
run2 0.5837 0.5785
20 Schilder (Thomson & Minnesota) run1 0.5437
run2 0.5550
21 Tatu (LCC) run1 0.7375 0.7133
22 Vanderwende (Microsoft Research & Stanford) run1 0.6025 0.6181
run2 0.5850 0.6170
23 Zanzotto (Milan & Rome) run1 0.6388 0.6441
run2 0.6250 0.6317

* Resubmitted after publication of the official results. Resubmission was allowed only in case of a bug fix, so that the updated results are the correct output of the system described in the RTE-2 proceedings.