Morpho Challenge 2010 – Semi-supervised and Unsupervised Analysis

The Challenge results are now available as well as the Challenge workshop program and slides.

The objective of the Challenge is to design a statistical machine learning algorithm that discovers which morphemes (smallest individually meaningful units of language) words consist of. Ideally, these are basic vocabulary units suitable for different tasks, such as text understanding, machine translation, information retrieval, and statistical language modeling.

The scientific goals are:

To learn of the phenomena underlying word construction in natural languages
To discover approaches suitable for a wide range of languages
To advance machine learning methodology

Morpho Challenge 2010 is a follow-up to our previous Morpho Challenge 2005, 2007, 2008 and 2009. The task in 2010 is similar to 2009, where the aim was to find the morpheme analysis of the word forms in the data. As a new task we will provide a possibility for semi-supervised learning using the available linguistic gold standard morpheme analysis.

Participation in the previous challenges is by no means a prerequisite for participation in Morpho Challenge 2010. Everyone is welcome and we hope to attract many participating teams. The results will be presented in a workshop. Please read the rules and see the schedule. The datasets are available for download. Submit your analyses (result files) by sending them by email to the organizers, or by indicating a location where the organizers can download your files. Remember also to describe your algorithm in a paper. Please read the formatting instructions in rules.

If you plan to participate in Morpho Challenge, please contact the organizers using the email address in contact and ask to be added in our mailing list. We will use this mailing list to provide news about the tasks, data and evaluations.

We are looking forward to an interesting challenge!

Mikko Kurimo, Krista Lagus, Sami Virpioja and Ville Turunen
Adaptive Informatics Research Centre, Aalto University School of Science and Technology (previously known as Helsinki University of Technology)
The organizers

References

Mathias Creutz and Krista Lagus (2005). Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0. Publications in Computer and Information Science, Report A81, Helsinki University of Technology, March.
[ Article (PDF) ]Teemu Hirsimäki, Mathias Creutz, Vesa Siivola, Mikko Kurimo, Janne Pylkkönen, and Sami Virpioja (2006). Unlimited vocabulary speech recognition with morph language models applied to Finnish.Computer Speech and Language, Volume 20, Issue 4, October, pages 515–541.
[ Article (PDF) ]

Sami Virpioja, Jaakko J. Väyrynen, Mathias Creutz, and Markus Sadeniemi (2007). Morphology-aware statistical machine translation based on morphs induced in an unsupervised manner. In Proceedings of the Machine Translation Summit XI, pages 491–498.
[ Article (PDF) ]

Mikko Kurimo, Sami Virpioja, Ville T. Turunen, Graeme W. Blackwood, and William Byrne. Overview and results of Morpho Challenge 2009. In Working Notes for the CLEF 2009 Workshop, Corfu, Greece, September 2009.
[ Article (PDF) ]

Rules

Submission of large data files

Send an email to the organizers morphochallenge2010mail.cis.hut.fi and tell where they can download the data files. Small data files (but not larger than a few MBs) can be emailed directly. Please, follow carefully the format of the result files described in datasets.

Acceptance

The organizers retain all rights to the Challenge data, which is given to the participants for use in this challenge only. The organizers may use the data submitted to the Challenge freely, without restrictions.

Eligibility

Anyone is allowed to participate. A participant may be either a single person or a group. A single person can participate in at most two groups. A participant is allowed to submit at most three different solutions, where each solution corresponds to a particular morpheme analysis method. Each of these methods may naturally be applied to each of the test languages. If a participant submits more than three solutions, the organizers decide which of the three will be accepted.

Test languages

Data sets are provided for four languages: English, Finnish, German and Turkish. Participants are encouraged to apply their algorithm to all of these test languages, but are free to leave some languages out, if they wish to do so.

(New languages may be added, if interested co-organizers, suitable data and evaluation analyses become available in time.)

Task

The task is to develop a system that conducts unsupervised morpheme analysis for every word form contained in a word list supplied by the organizers for each test language.

The participants will be pointed to corpora in which the words occur, so that the algorithms may utilize information about word context.

New in 2010: A new category for semi-supervised learning algorithms using the available linguistic gold standard morpheme analysis. The abstracts submitted by the participants must contain clear descriptions of which steps of supervision or parameter optimization are involved in the algorithms.

Participants are also allowed to use additional text corpora from any source for training their algorithms.

Competitions

The segmentations will be evaluated in three complementary ways:

Competition 1: The proposed morpheme analyses will be compared to a linguistic “gold standard”.

Competition 2: Information retrieval (IR) experiments will be performed, where the words in the documents and queries will be replaced by their proposed morpheme representations. The search will then be based on morphemes instead of words.

Competition 3: Machine Translation (MT) model is trained, where the words in the source language documents will be replaced by their proposed morpheme representations. The words in the source language evaluation data will then also be replaced by their proposed morpheme representations and the translation will be based on morphemes instead of words.

Competition 1 will include all four test languages. Winners will be selected separately for each language and category (unsupervised or semi-supervised). As a performance measure, the F-measure of accuracy of suggested morpheme analyses is utilized. Should two solutions produce the same F-measure, the one with higher precision will win.

Competition 2 will include three of the test languages. The organizers will perform the IR experiments based on the morpheme analyses submitted by the participants.

Competition 3 will include two of the test languages. Translation will be done from the test language to English. The organizers will train the translation models and perform the evaluation of the translations using an automatic metric such as BLEU.

Workshop and publication

All good results will be acknowledged with fame and glory. Presentations for the challenge workshop will be selected by the organizers based on the results and 4-page extended abstract describing the algorithm and the data submission. However, all groups who have submitted results and a paper are welcome to participate in the workshop to listen to the talks and join the discussions.

Extended abtracts

For the extended abstract you can use the two-column ACL format. For templates and general formatting instructions, see ACL 2010 Instructions for Preparing Camera-Ready Papers. The length of the paper should be around 4 pages. Email your extended abstract to the organizers by July 16.

Arbitration

In the case of disagreement the organizers will decide the final interpretation of the rules.

Competition 1

In Competition 1, for each language, the morpheme analyses proposed by the participants’ algorithm will be compared against a linguistic gold standard. Since the task at hand involves unsupervised learning, it cannot be expected that the algorithm comes up with morpheme labels that exactly correspond to the ones designed by linguists. That is, no direct comparison will take place between labels as such (the labels in the proposed analyses vs. labels in the gold standard). What can be expected, however, is that two word forms that contain the same morpheme according to the participants’ algorithm also have a morpheme in common according to the gold standard. For instance, in the English gold standard, the words “foot” and “feet” both contain the morpheme “foot_N”. It is thus desirable that also the participants’ algorithm discovers a morpheme that occurs in both these word forms (be it called “FOOT”, “morpheme784”, “foot” or something else).

In practice, the evaluation will take place by sampling a large number of word pairs, such that both words in the pair have at least one morpheme in common. As the evaluation measure, we will use F-measure, which is the harmonic mean of Precision and Recall:

F-measure = 1/(1/Precision + 1/Recall).

Precision is here calculated as follows: A number of word forms will be randomly sampled from the result file provided by the participants; for each morpheme in these words, another word containing the same morpheme will be chosen from the result file by random (if such a word exists). We thus obtain a number of word pairs such that in each pair at least one morpheme is shared between the words in the pair. These pairs will be compared to the gold standard; a point is given for each word pair that really has a morpheme in common according to the gold standard. The maximum number of points for one sampled word is normalized to one. The total number of points is then divided by the total number of sampled words.

For instance, assume that the proposed analysis of the English word “abyss” is: “abys +s”. Two word pairs are formed: Say that “abyss” happens to share the morpheme “abys” with the word “abysses”; we thus obtain the word pair “abyss – abysses”. Also assume that “abyss” shares the morpheme “+s” with the word “mountains”; this produces the pair “abyss – mountains”. Now, according to the gold standard the correct analyses of these words are: “abyss_N”, “abyss_N +PL”, “mountain_N +PL”, respectively. The pair “abyss – abysses” is correct (common morpheme: “abyss_N”), but the pair “abyss – mountain” is incorrect (no morpheme in common). Precision for the word “abyss” is thus 1/2 = 50%.

Recall is calculated analogously to precision: A number of word forms are randomly sampled from the gold standard file; for each morpheme in these words, another word containing the same morpheme will be chosen from the gold standard by random (if such a word exists). The word pairs are then compared to the analyses provided by the participants; a point is given for each sampled word pair that has a morpheme in common also in the analyses proposed by the participants’ algorithm. Points per word is normalized to one and the total number of points is divided by the total number of words.

For words that have several alternative analyses, as well as for word pairs that have more than one morpheme in common, normalization of the points is carried out. In short, an equal weight is given for each alternative analysis, as well as each word pair in an analysis. E.g., if a word has three alternative analyses, the first analysis has four morphemes, and the first word pair in that analysis has two morphemes in common, each of the two common morphemes will amount to 1/3*1/4*1/2=1/24 of the one point available for that word.

Small sets of the gold standards used are available on the datasets page for semi-supervised learning and parameter validation. Neither the labeled training set nor the development test set contain any of the word forms used in the final test set.

Evaluation of a sample (development test set)

You can evaluate your morphological analyses against the available gold standards (separately for each test language). The program to use for this is the Perl script: eval_morphemes_v2.pl. The evaluation program is invoked as follows:

eval_morphemes_v2.pl [-trace] wordpairsfile_goldstd wordpairsfile_result goldstdfile resultfile

Four files are given as arguments to eval_morphemes_v2.pl:

wordpairsfile_goldstd: this is the “random word pairs file” available for download on the datasets page. This file is needed in the calculation of an estimate of the recall of the proposed morpheme analyses.
wordpairsfile_result: this file has to be generated using another program (see below). It is needed in the calculation of a rough estimate of the precision of the proposed morpheme analyses.
goldstdfile:this is the sample of the gold standard available for download on the datasets page. This file contains the correct morpheme analyses.
resultfile: this is the result file that your algorithm produces, i.e., a list of words and their proposed morpheme analyses. Make sure that the file format is correct and does not include any additional whitespace.

The -trace argument is optional and produces output for every evaluated word separately. Regardless of the status of the trace argument, the evaluation program produces output of the following kind:

PART0. Precision: 69.00% (96/139); non-affixes: 81.55% (51/63); affixes: 58.73% (45/76)
PART0. Recall:    25.59% (142/556); non-affixes: 49.78% (105/211); affixes: 10.78% (37/345)
PART0. F-measure: 37.33%; non-affixes: 61.82%; affixes: 18.22%
#
TOTAL. Precision: 69.00%; non-affixes: 81.55%; affixes: 58.73%
TOTAL. Recall:    25.59%; non-affixes: 49.78%; affixes: 10.78%
TOTAL. F-measure: 37.33%; non-affixes: 61.82%; affixes: 18.22%

Note that results are displayed for partition 0 (PART0) and for the entire data (TOTAL). The total scores are here the same as the scores of PART0, since there is only one partition. It is, however, possible to split the data into several partitions and compute results for each partition separately. The overall scores are then calculated as the mean over the partitions. Splitting into partitions is a feature reserved for the final evaluation, when we will assess the statistical significance of the differences between the participants’ algorithms.

The figures that count in the final evaluation are the first precision, recall, and F-measure values on the TOTAL lines. These values pertain to all morphemes, but there are also separate statistics for morphemes classified as non-affixes vs. affixes. What counts as an affix is a morpheme with a label starting with a plus sign, e.g., “+PL”, “+PAST”. This naming convention is applied in the gold standard, which means that you do not have to do anything in order to get the non-affixes/affixes statistics right as far as recall is concerned. However, if you want the same kind of information also for precision, your algorithm must have a means of discovering which morphemes are likely affixes and tag these morphemes with an initial plus sign. Note that it is fully up to you whether you do this or not; it will not affect your position in the competition in any way.

Sampling word pairs for the calculation of an estimate of the precision

In order to get an estimate of the precision of the algorithm, you need to provide the evaluation script eval_morphemes_v2.pl with a file containing word pairs sampled from your result file. Unfortunately, the estimate is likely to be fairly rough. The reason for this is that you do not have the entire gold standard at your disposal. Thus, if you sample pairs of words that are not included in the gold standard labels that you can access, it is impossible to know whether the proposed morphemes are correct or not. What you can do, however, is to make sure that each word that goes into a word pair actually does occur in the gold standard sample. There is still a problem that your algorithm might not propose that many common morphemes for the words within this limited set, and thus the estimate will be based on rather few observations.

Anyway, this is how to do it: First, make a list of relevant words, that is, words that are present in the gold standard sample available:

cut -f1 goldstdfile > relevantwordsfile

Then sample word pairs for 300 words selected by random from your results file:

sample_word_pairs_v2.pl -n 300 -refwords relevantwordsfile < resultfile > wordpairsfile_result

You can use more than 300 words if you wish (the maximum is the amount of words in the gold standard file), or sample several word pair files with different random initializations (-rand switch), but remember that the estimate of precision that you will obtain is biased in any case, due to the limited amount of gold standard labels.

The necessary Perl program is sample_word_pairs_v2.pl. The output file wordpairsfile_result is used as input to eval_morphemes_v2.pl (see above).

Competition 2

Competition 2 does not necessarily require any extra effort by the participants. The organizers will use the analyses provided by the participants in information retrieval experiments. Data from CLEF will be used.
However, those participants who wish to submit morpheme analysis for words in their actual context (competition 2b), please contact the organizers for more information how to register to CLEF to obtain the full texts.

In the competition 2 (and 2b) the words in the queries and documents will be replaced by the corresponding morpheme analyses provided by the participants. We will perform the IR evaluation using the state-of-the-art Okapi (BM25) retrieval method (the latest version of the freely available LEMUR toolkit. The most common morphemes in each participant’s submission will be left out from the index. The size of this stoplist will be proportional to the amount of the text data in each language and the stoplist size will be the same for each participant’s submission. The evaluation criterion will be Uninterpolated Average Precision. The segmentation with the highest Average Precision will win. The winner is selected separately for competitions 2 and 2b in each language.

Competition 3

In competition 3, the morpheme analyses proposed by the participants’ algorithm will be evaluated in a statistical machine translation (SMT) framework. The translation models will be trained to translate from a morphologically complex source language to English. The words of the source language will be replaced by their morpheme analyses before training. The translations from this morpheme-to-word model will be combined with translations from a standard word-to-word translation model. For all models, we will use a state-of-the-art phrase-based SMT system. Evaluation of the translations will be performed by applying an automatic metric such as BLEU on a held-out test set.

Competition 3 does not require any extra effort by the participants. The organizers will use the analyses provided by the participants in machine translation experiments. Data from the Europarl corpus will be used. Those participants who wish to submit morpheme analysis for words in their actual context, please contact the organizers for information on how to get the full corpus.

References

Rules

Submission of large data files

Acceptance

Eligibility

Test languages

Task

Competitions

Workshop and publication

Extended abtracts

Arbitration

Competition 1

Evaluation of a sample (development test set)

Competition 2

Competition 3

Knowledge 4 All Foundation Ltd.