Motivation

A fundamental phenomenon of natural language is the variability of semantic expression, where the same meaning can be expressed by or inferred from different texts. Many natural language processing applications, such as Question Answering (QA), Information Retrieval (IR), Information Extraction (IE), and (multi) document summarization need to model this variability in order to recognize that a particular target meaning can be inferred from different text variants. Even though many applications face similar underlying semantic problems, these problems are usually addressed in an application-oriented manner. Consequently it is difficult to compare, under a generic evaluation framework, semantic methods that were developed within different applications. The PASCAL RTE Challenge introduces textual entailment as a common task and evaluation framework, covering a broad range of semantic-oriented inferences needed for practical applications. This task is therefore suitable for evaluating and comparing semantic-oriented models in a generic manner. Eventually, work on textual entailment may promote the development of general semantic “engines”, which will be used across multiple applications.

Textual Entailment

Textual entailment recognition is the task of deciding, given two text fragments, whether the meaning of one text is entailed (can be inferred) from another text (see the Instructions tab for the specific operational definition of textual entailment assumed in the challenge). This task captures generically a broad range of inferences that are relevant for multiple applications. For example, a Question Answering (QA) system has to identify texts that entail the expected answer. Given the question “Who killed Kennedy?”, the text “the assassination of Kennedy by Oswald” entails the expected answer “Oswald killed Kennedy”. Similarly, in Information Retrieval (IR) the concept denoted by a query expression should be entailed from relevant retrieved documents. In multi-document summarization a redundant sentence or expression, to be omitted from the summary, should be entailed from other expressions in the summary. In Information Extraction (IE) entailment holds between different text variants that express the same target relation. And in Machine Translation evaluation a correct translation should be semantically equivalent to the gold standard translation, and thus both translations have to entail each other. Thus, modeling textual entailment may consolidate and promote broad research on applied semantic inference.

Task Definition

Participants in the evaluation exercise are provided with pairs of small text snippets (one or more sentences in English), which we term Text-Hypothesis (T-H) pairs. Examples were manually tagged for entailment (i.e. whether T entails H or not) by human annotators and will be divided into a Development Set, containing 800 pairs, and a Test Set, containing 800 pairs. Participating systems will have to decide for each T-H pair whether T indeed entails H or not, and results will be compared to the manual gold standard.
The goal of the RTE challenges is to provide opportunities for presenting and comparing possible approaches for modeling textual entailment. In this spirit, we aim at an explorative rather than a competitive setting. While participant results will be reported there will not be an official ranking of systems. A development set is released first to provide typical examples of the different types of test examples. The test set will be released three weeks prior to the result submission date. We regard it as acceptable to run automatic knowledge acquisition methods (such as synonym collection from corpora or the Web) specifically for the linguistic constructs that are present in the test data, as long as the methodology is general and fully automated, and the cost of running the learning/acquisition procedure at full scale can be reasonably estimated.

Dataset Collection and Application Settings

The dataset of Text-Hypothesis pairs was collected by human annotators. It consists of four subsets, which correspond to typical success and failure settings in different applications (as listed below). Within each application setting the annotators selected both positive entailment examples (annotated YES), where T does entail H, as well as negative examples (annotated NO), where entailment does not hold (50%-50% split). Some T-H examples appear in the Instructions section. H is a (usually short) single sentence, and T consists of one or more sentences, up to a short paragraph, to simulate an even more realistic scenario.

Information Retrieval (IR):

In this application setting, the hypotheses are propositional IR queries, which specify some statement, e.g. “Alzheimer’s disease is treated using drugs”. The hypotheses were adapted and simplified from standard IR evaluation datasets (TREC and CLEF). Texts (T) that do or do not entail the hypothesis were selected from documents retrieved by different search engines (e.g. Google, Yahoo and Microsoft) for each hypothesis. In this application setting it is assumed that relevant documents (from an IR perspective) must necessarily entail the hypothesis.

Multi-document summarization (SUM):

In this setting T and H are sentences taken from a news document cluster, a collection of news articles that describe the same news item. Annotators were given output of multi-document summarization systems, including the document clusters and the summary generated for each cluster. The annotators picked sentence pairs with high lexical overlap, preferably where at least one of the sentences was taken from the summary (this sentence usually played the role of T). For positive examples, the hypothesis was simplified by removing sentence parts, until it was fully entailed by T. Negative examples were simplified in the same manner. This process simulates the need of a summarization system to identify information redundancy, which should be avoided in the summary.

Question Answering (QA):

Annotators were given questions and the corresponding answers returned by QA systems. Transforming a question-answer pair to text-hypothesis pair consisted of the following stages: First, the annotators picked from the answer passage an answer term of the expected answer type, either a correct or an incorrect one. Then, the annotators turned the question into an affirmative sentence with the answer term “plugged in”. These affirmative sentences serve as the hypotheses (H), and the original answer passage serves as the text (T). For example, given the question, “Who is Ariel Sharon?” and an answer text “Israel‘s prime Minister, Ariel Sharon, visited Prague” (T), the question is turned into the statement “Ariel Sharon is the Israeli Prime Minister” (H), producing a positive entailment pair. This process simulates the need of a QA system to verify that the retrieved passage text indeed entails the provided answer.

Information Extraction (IE):

This task is inspired by the Information Extraction (and Relation Extraction) application, adapting the setting to having pairs of texts rather than a text and a structured template. The pairs were generated using different approaches. In the first approach, ACE-2006 relations (the relations proposed in the ACE-2007 RDR task) were taken as templates for hypotheses. Relevant news articles were then collected as texts (T) and corresponding hypotheses were generated manually based on the ACE templates and slot fillers taken from the text. For example, given the ACE relation ‘X work for Y‘ and the text “An Afghan interpreter, employed by the United States, was also wounded.“(T), a hypothesis “An interpreter worked for Afghanistan.” is created, producing a non-entailing (negative) pair. In the second approach, the MUC-4 annotated dataset was similarly used to create entailing pairs. In the third approach, the outputs of actual IE systems were used to generate entailing and non-entailing pairs. In the forth approach, new types of hypotheses that may correspond to typical IE relations were manually generated for different sentences in the collected news articles. These processes simulate the need of IE systems to recognize that the given text indeed entails the semantic relation that is expected to hold between the candidate template slot fillers.

Task Definition

We consider an applied notion of textual entailment, defined as a directional relation between two text fragments, termed T the entailing text, and H the entailed text. We say that T entails H if, typically, a human reading T would infer that H is most likely true. This somewhat informal definition is based on (and assumes) common human understanding of language as well as common background knowledge. Textual entailment recognition is the task of deciding, given T and H, whether T entails H.

ID TEXT HYPOTHESIS TASK ENTAILMENT
1 The drugs that slow down or halt Alzheimer’s disease work best the earlier you administer them. Alzheimer’s disease is treated using drugs. IR YES
2 Drew Walker, NHS Tayside’s public health director, said: “It is important to stress that this is not a confirmed case of rabies.” A case of rabies was confirmed. IR NO
3 Yoko Ono unveiled a bronze statue of her late husband, John Lennon, to complete the official renaming of England’s Liverpool Airport as Liverpool John Lennon Airport. Yoko Ono is John Lennon’s widow. QA YES
4 Arabic, for example, is used densely across North Africa and from the Eastern Mediterranean to the Philippines, as the key language of the Arab world and the primary vehicle of Islam. Arabic is the primary language of the Philippines. QA NO
5 About two weeks before the trial started, I was in Shapiro’s office in Century City. Shapiro works in Century City. QA YES
6 Meanwhile, in his first interview to a Western print publication since his election as president of Iran earlier this year, Ahmadinejad attacked the “threat” to bring the issue of Iran’s nuclear activity to the UN Security Council by the US, France, Britain and Germany. Ahmadinejad is a citizen of Iran. IE YES
7 The flights begin at San Diego’s Lindbergh Field in April, 2002 and follow the Lone Eagle’s 1927 flight plan to St. Louis, New York, and Paris Lindbergh began his flight from Paris to New York in 2002. QA NO
8 The world will never forget the epic flight of Charles Lindbergh across the Atlantic from New York to Paris in May 1927, a feat still regarded as one of the greatest in aviation history. Lindbergh began his flight from New York to Paris in 1927. QA YES
9 Medical science indicates increased risks of tumors, cancer, genetic damage and other health problems from the use of cell phones. Cell phones pose health risks. IR YES
10 The available scientific reports do not show that any health problems are associated with the use of wireless phones. Cell phones pose health risks. IR NO

Table 1: Example T-H pairs

Some additional judgment criteria and guidelines are listed below (examples are taken from Table 1):

  • Entailment is a directional relation. The hypothesis must be entailed from the given text, but the text need not be entailed from the hypothesis.
  • The hypothesis must be fully entailed by the text. Judgment would be “NO” if the hypothesis includes parts that cannot be inferred from the text.
  • Cases in which inference is very probable (but not completely certain) are judged as “YES“. In example #5 one could claim that although Shapiro’s office is in Century City, he actually never arrives to his office, and works elsewhere. However, this interpretation of T is very unlikely, and so the entailment holds with high probability. On the other hand, annotators were guided to avoid vague examples for which inference has some positive probability which is not clearly very high.
  • Our definition of entailment allows presupposition of common knowledge, such as: a company has a CEO, a CEO is an employee of the company, an employee is a person, etc. For instance, in example #6, the entailment depends on knowing that the president of a country is also a citizen of that country.

Data Sets and Format

Both Development and Test sets are formatted as XML files, as follows:

<pair id=”id_num” entailment=”YES|NO” task=”IE|IR|QA|SUM” length=”short|long”>

<t>the text…</t>

<h>the hypothesis…</h>

</pair>

Where:

  • each T-H pair appears within a single <pair> element.
  • the element <pair> has the following attributes:
    • id, a unique numeral identifier of the T-H pair.
    • task, the acronym of the application setting from which the pair has been generated (see introduction): “IR”,”IE”,”QA” or “SUM”.
    • entailment (in the development set only), the gold standard entailment annotation, being either “YES” or “NO”.
    • length: long indicates when T is a longer snippet.
  • the element <t> (text) has no attributes, and it may be made up of one or more sentences.
  • the element <h> (hypothesis)┬áhas no attributes, and it usually contains one simple sentence.

The data is split to a development set and a test set, to be released separately. The goal of the development set is to guide the development and tuning of participating systems. Notice that since the given task has an unsupervised nature it is not expected that the development set can be used as a main resource for supervised training, given its anecdotal coverage. Rather it is typically assumed that systems will be using generic techniques and resources.

Result Submission

Systems must tag each T-H pair as either “YES”, predicting that entailment does hold for the pair, or as “NO” otherwise. No partial submissions are allowed, i.e. the submission must cover the whole dataset.

Systems should be developed based on the development data set. Analyses of the test set (either manual or automatic) should not impact in any way the design and tuning of systems that publish their results on the RTE-3 test set. We regard it as acceptable to run automatic knowledge acquisition methods (such as synonym collection) specifically for the lexical and syntactic constructs that will be present in the test set, as long as the methodology and procedures are general and not tuned specifically for the test data. In any case, participants are asked to report about any process that was performed specifically for the test set.

Evaluation Measures

The evaluation of all submitted runs will be automatic. The judgments (classifications) returned by the system will be compared to those manually assigned by the human annotators (the Gold Standard). The percentage of matching judgments will provide the accuracy of the run, i.e. the fraction of correct responses.

As a second measure, an Average Precision measure, will be computed. This measure evaluates the ability of systems to rank all the T-H pairs in the test set according to their entailment confidence (in decreasing order from the most certain entailment to the least certain). The more the system is confident that T entails H, the higher the ranking is. A perfect ranking would place all the positive pairs (for which the entailment holds) before all the negative pairs. Average precision is a common evaluation measure for system rankings, and is computed as the average of the system’s precision values at all points in the ranked list in which recall increases, that is at all points in the ranked list for which the gold standard annotation is YES. More formally, it can be written as follows:

1/R * sum for i=1 to n (E(i) * #-entailing-up-to-pair-i/i)

where n is the number of the pairs in the test set, R is the total number of positive pairs in the test set, E(i) is 1 if the i-th pair is positive and 0 otherwise, and i ranges over the pairs, ordered by their ranking. This score will be computed for systems that will provide as output a confidence-ranked list of all test examples (in addition to the YES/NO output for each example).

Result Submission Format

Results will be submitted in a file with one line for each T-H pair in the test set, in the following format:

pair_id<blank space>judgment

where

  • pair_id is the unique identifier of each T-H pair, as it appears in the test set.
  • judgment is either “YES” or “NO”.

The first line in the file should look as follows:

ranked:<blank space>”YES|NO”

The first line indicates whether the submission includes confidence ranking of the pairs (see evaluation measures above). Average precision will be computed only for systems that specify “ranked: YES” in the first line. If the submission includes confidence ranking, the pairs in the file should be ordered by decreasing entailment confidence order: the first pair should be the one for which the entailment is most certain, and the last pair should be the one for which the entailment is least likely (i.e. the one for which the judgment as “NO” is the most certain). Thus, in a ranked list all the positively classified pairs are expected to appear before all those that are classified as negative.

Each submitted run must be a plain text file, with a filename composed of a unique 5-6 element alpha-numeric string, and the number of run separated by “-“, e.g. CLCT07-run1.txt. Participating teams will be allowed to submit 2 result files per system. The corresponding result files should be named XXXXX-run1.txt, and XXXXX-run2.txt if a second run is submitted for the same system.

The results files should be zipped and submitted via email to infocelct at itc . it, with the subject line “[RTE3 RESULT SUBMISSION]“.

System Reports

Participants are requested to submit a full system report by March 26, 2007. It should be noted that the change of schedule for paper submissions, from April 2 to March 26, 2007, was due to the merging of the RTE workshop with the Paraphrasing Workshop, which will result in unifying the reviewing process of the two types of papers ( system and technical ) and make it more competitive for RTE system reports. As the schedule is quite tight, we suggest preparing a draft of the paper in advance of receiving results of the system evaluation. Report submissions will follow the same procedure for article submissions as for the main workshop (using the START system). Report submissions must be uploaded by filling out the submission form at the following URL: www.softconf.com/acl07/ACL07-WS9/submit.html. Please remember to select the “RTE3 paper” option in the Submission category field.

Reports should include up to 6 double column pages, using the ACL Style files and guidelines. As the reports presented at the workshop are expected to be very informative, in order to further explore entailment phenomena and any alternative methods of approaching it, we suggest an analysis of results and a presentation of any general observations you might have about the entailment recognition problem, in addition to the straightforward system description.

Due to workshop time limitations this year, not all system reports will be presented orally. The reviewing of RTE3 system reports will be blind and the papers should not include the authors’ names and affiliations. The best will presented orally, while the remaining papers which pass a sufficient level of quality will be presented as posters. Reviews with comments for the camera ready version and decisions about presentation in the workshop will be sent back to the authors by April 23, 2007.

The camera ready version of each report, to be included in the workshop proceedings, should be submitted in pdf format (with no page numbers) by May 6, 2007.

Evaluation

The evaluation of all submitted runs will be automatic. The judgments (classifications) returned by the system will be compared to those manually assigned by the human annotators (the Gold Standard). The percentage of matching judgments will provide the accuracy of the run, i.e. the fraction of correct responses.

As a second measure, an Average Precision measure will be computed. This measure evaluates the ability of systems to rank all the T-H pairs in the test set according to their entailment confidence (in decreasing order from the most certain entailment to the least certain). The more the system is confident that T entails H, the higher the ranking is. A perfect ranking would place all the positive pairs (for which the entailment holds) before all the negative pairs. Average precision is a common evaluation measure for system rankings, and is computed as the average of the system’s precision values at all points in the ranked list in which recall increases, that is at all points in the ranked list for which the gold standard annotation is YES (Voorhees and Harman, 1999). More formally, it can be written as follows:
1/R * sum for i=1 to n (E(i) *#-correct-up-to-pair-i/i)

where n is the number of the pairs in the test set, R is the total number of positive pairs in the test set, E(i) is 1 if the i-th pair is positive and 0 otherwise, and i ranges over the pairs, ordered by their ranking.

Note the difference between this measure and the Confidence Weighted Score used in the first challenge.

This score will be computed for systems that will provide as output a confidence-ranked list of all test examples (in addition to the YES/NO output for each example).

Organisers

Danilo Giampiccolo, CELCT, Trento
Ido Dagan, Bar Ilan University
Bill Dolan, Microsoft Research
Bernardo Magnini, ITC-irst, Istituto per la Ricerca Scientifica e Tecnologica, Povo (TN), Italy