Letter-to-Phoneme Conversion Challenge | Knowledge 4 All Foundation Ltd.

Letter-to-phoneme conversion is a classic problem in machine learning (ML), as it is both hard (at least for languages like English and French) and important. For non-linguists, a ‘phoneme’ is an abstract unit corresponding to the equivalence class of physical sounds that ‘represent’ the same speech sound. That is, members of the equivalence class are perceived by a speaker of the language as the ‘same’ phonemes: the word ‘cat’ consists of three phonemes, two of which are shared with the word ‘bat’. A phoneme is defined by its role in distinguishing word pairs like ‘bat’ and ‘cat’. Thus, /b/ and /k/ are different phonemes. But the /b/ in ‘bat’ and the /b/ in ‘tab’ are the same phoneme, in spite of their different acoustic realisations, because the difference between them is never used (in English) to signal a difference between minimally-distinctive word-pairs.

The problem is important because letter-to-sound conversion is central to the technology of speech synthesis, where input text has to be transformed to a representation that can drive the synthesis hardware, and necessary for some aspects of speech recognition. It is usual to employ phonemic symbols as the basis of this representation. However, letter-to-sound conversion is not a single mapping problem but a class of problems, which include not just automatic pronunciation but stress assignment, letter-phoneme alignment, syllabification and/or morphemic decomposition, and so on, hence the PRONALSYL acronym. Although we intend to give most prominence to letter-to-phoneme conversion, the community is challenged to develop and submit innovative solutions to these related problems.

As the specifics of the letter-to-sound problem vary from language to language, we intend that participants try their algorithms on a variety of languages. To this end, we will be making available different dictionaries covering a range of languages. They will minimally give a list of word spellings and their corresponding pronunciations. Be warned that the different dictionaries will typically use different conventions for representing the phonemes of the relevant language; this is all part of the fun. If participants have public-domain dictionaries of other interesting languages that they are willing to donate to the PRONALSYL challenge, we will be very pleased indeed to receive them. Please contact one of the organisers.

Virtually all existing letter-to-phoneme conversion methods require the letters of the word spelling and the phonemes of the pronunciation to be aligned in one-to-one fashion, as a bijection. This converts the string transcription problem to a classification problem. We will pre-align all the dictionaries using our EM-based algorithm before making them available to PRONALSYL participants. We also intend to make available a self-service alignment facility, so that researchers can submit their dictionaries, align them and have the results sent back by email. PLEASE WATCH THIS SPACE.

We also hope to make a couple of representative learning algorithms available for participants to use as benchmarks for quick initial assessment of their own algorithms. One of these will be pronunciation by analogy (PbA); the other will probably be a well-known rule induction algorithm. I am negotiating with the owner of the latter to let us use it.

Finally, not everyone is convinced that machine learning is the right way to approach this problem. In particular, there has been a long tradition of expert linguists writing rules manually. These rules are intended to encode the expert’s knowledge of spelling-to-sound regularities in the language of interest. We are very keen for participants both to donate their own rules for comparison with ML methods, and/or to report on such comparisons. An especially interesting issue is whether or not the relative advantages and disadvantages of rules versus ML approaches vary systematically across languages according to some measure of the complexity of the writing system.

The timetable for the challenge is as follows:

February 2006	Challenge goes live
10-12 April 2006	Preliminary reporting at Pascal workshop in Venice
January 2007	Challenge closes

The timescale is rather longer than most Pascal challenges, not least because our principal motivation is to produce the best possible result for letter-to-sound conversion rather than to conduct a prize competition. We want to give participants every chance to achieve good performance without being unduly worried about a level playing field.

Organising Committee:

Antal van den Bosch (Tilburg University, The Netherlands)
Stanley Chen (IBM T. J. Watson Research Center, USA)
Walter Daelemans (University of Antwerp, Belgium)
Bob Damper (University of Southampton, UK, Co-Chair)
Kjell Gustafson (Acapela Group and KTH, Sweden)
Yannick Marchand (National Research Council Canada, Co-Chair)
Francois Yvon (ENST, France)

Introduction to PRONALSYLEver since Sejnowski and Rosenberg’s pioneering work on NETtalk in 1987, the automatic conversion of English text to a phonemic representation of that text has been a benchmark problem in machine learning (ML). Not only is the problem very hard (certainly for languages like English and French), it is also of practical importance being central to the technology of text-to-speech (TTS) synthesis and highly useful for some aspects of speech recognition. Although Sejnowski and Rosenberg’s work was not the first attempt at applying ML to text-to-phoneme conversion (cf. Oakey and Cawthorne 1981; Lucassen and Mercer 1984; etc.), it was instrumental in focusing wider attention on the problem, as well as providing a publically accessible database for training and test, in the form of 20,008 words taken from Webster’s dictionary.

In spite of the passage of time since 1987, this problem remains worthy of very serious attention. For many years, the majority opinion in the speech research community has been that NETtalk (and other ML approaches) did not perform anywhere near as well as letter-to-sound rules manually written by an expert linguist or phonetician. Opinion on this point, however, has been slowly changing since careful performance comparisons between rules and ML approaches (like that of Damper et al. 1999) started to appear. However, the best performing method in Damper et al.’s study, namely pronunciation by analogy (Dedina and Nusbaum 1991; Damper and Eastmond 1997; Marchand and Damper 2000), still only achieved 71.6% words correct on a rather small test set (by current standards) of just over 16,000 words, indicating that this is not a solved problem. Further, the work of just one group could only hope to sample the enormous range of possible ML approaches, leaving many others unexplored. Among these are techniques like hidden Markov models, weighted finite state transducers, support vector machines and other kernel learning methods, etc. A major motivation of the PRONALSYL challenge is to widen the scope of comparative evaluation to include as many such approaches as possible.

Most machine learning approaches to automatic pronunciation, including NETtalk-like neural networks and PbA, require letters and phonemes to have been previously aligned in one-to-one fashion (i.e., a bijection), so as to convert the problem of string transduction to one of classification. Although work to date in this field has generally used manually-aligned data, this becomes impractical as we move to larger dictionaries and seek to migrate text-to-speech technology to a wider range of languages. However, automatic alignment turns out to be a very hard problem, so there is an urgent need for improved methods to use in conjunction with automatic pronunciation. For this reason, the PRONALSYL challenge covers not just pronunciation, but related problems of alignment, and other processes that affect pronunciation (what van den Bosch 1997 calls “word-pronunciation subtasks”) such as stress assignment and syllabification.

Concerning the latter sub-task, we have recently shown that (not surprisingly!) separating the input text into syllables eases the problem of pronunciation (Marchand and Damper 2006). At this stage, however, the syllabification has to be done manually to achieve a performance gain. We have not yet found an automatic syllabification algorithm that improves PbA relative to our standard model that does not consider syllable structure.

Against this background, the major problems addressed by the challenge are:

To improve on the best automatic pronunciation result(s) so far achieved, either by use of a novel ML approach or by improving an existing one.
To do this for a range of languages and language resources, both widely-spoken (e.g., French, English, German) and minority (Frisian), as well as covering the continuum of deep to shallow orthographies (e.g., from French and English through to Spanish and Serbian).
In view of the problem of alignment EITHER to improve existing alignment methods OR to demonstrate superior performance on an ML method that does not require explicit alignment.
In view of the importance of syllabification (and the related issue of morphemic decomposition), to demonstrate improved syllabification and/or hyphenation algorithms leading to superior pronunciation performance.

As learning pronunciation is an ill-posed problem, because of problems such as heterophonous homographs, variability of pronunciation, etc., appropriate use of prior knowledge to constrain solutions is important. Further, it is not always clear what the exact form of the output should be (segmental string, with/without stress markers, with/without syllabification). We therefore strongly encourage contributions that address issues such as:

How can ML techniques take advantage of additional input such as part of speech, or morphological decomposition?
ML techniques for inducing the right level of ambiguity (e.g., multiple output pronunciations)
What is the right strategy for pronunciation induction? Should we find the pronunciation first, followed (say) by stress then syllabification, or find the syllabification first followed by …, or should we do them all in parallel? (Antal van den Bosch’s PhD thesis describes some important work on this latter topic as does the Marchand and Damper syllabification paper.)

References

van den Bosch, A. P. J. (1997). Learning to Pronounce Written Words: A Study in Inductive Language Learning, PhD Thesis, University of Maastricht, The Netherlands.

Damper, R. I. and J. F. G. Eastmond (1997). Pronunciation by analogy: Impact of implementational choices on performance. Language and Speech 40(1), 1-23.

Damper, R. I., Y. Marchand, M. J. Adamson, and K. Gustafson (1999). Evaluating the pronunciation component of text-to-speech systems for English: A performance comparison of different approaches. Computer Speech and Language 13(2), 155-176.

Dedina, M. J. and H. C. Nusbaum (1991). PRONOUNCE: A program for pronunciation by analogy. Computer Speech and Language 5(1), 55-64.

Lucassen, J. M. and R. L. Mercer (1984). An information theoretic approach to the automatic determination of phonemic baseforms. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’84, San Diego, CA, pp. 42.5.1-42.5.4.

Marchand, Y. and R. I. Damper (2000). A multistrategy approach to improving pronunciation by analogy. Computational Linguistics 26(2), 195-219.

Marchand, Y. and R. I. Damper (2006). Can syllabification improve pronunciation by analogy? Natural Language Engineering, in press.

Oakey, S. and R. C. Cawthorne (1981). Inductive learning of pronunciation rules by hypothesis testing and correction. In Proceedings of International Joint Conference on Artificial Intelligence, IJCAI-81, Vancouver, Canada, pp. 109- 114.

Sejnowski, T. J. and C. R. Rosenberg (1987). Parallel networks that learn to pronounce English text. Complex Systems 1(1), 145-168.

Datasets: Dictionary File Download

Please select a language to show the dictionaries available in that language. Note that some have an associated README file; this will be indicated to you as a separate download. The resources available here are strictly for research use only. More dictionaries should become available over time.The dictionaries have a variety of different formats. That is, some have space separation of symbols, some use two characters for phoneme codes, and so on. This is an unavoidable nuisance; we think it should be possible for Challenge participants to figure out the various formats, without too much effort.Final evaluation will be done using 10-fold cross validation (see under Evaluation tab). To this end, each dictionary has been partitioned into 10 ‘folds’. It is required that everyone use the same folds to allow proper comparision of results. Note that (for technical reasons that I won’t go into) not all folds have exactly the same number of words. Do not worry about this; just carry on regardless.We have supplied a simple pronunciation by analogy (PbA) program to act as a benchmark, giving you something to compare your method against. We also hope to add some other benchmarks (e.g., a decision tree method). We would very much appreciate offers from participants to supply other benchmarks, such as a naive Bayes classifier.

Instructions

Although not essential, it would be a great help to the organisers if you initially signaled your intention to participate by emailing Bob Damper at rid@ecs.soton.ac.uk.Decide which language or languages you would like to work with and download the relevant dictionaries from the Datasets tab. You are especially welcome to contribute resources for new languages; please contact the organisers about this. For such additional languages, it is necessary that you clear the copyright position with the owners of the data.As most methods of letter-to-phoneme conversion require datasets in which each letter of every word is aligned with its corresponding phoneme, we have pre-aligned the dictionaries using the EM algorithm. If your method does not require alignment (good for you if it doesn’t!), or if you want to take up the challenge of producing a good automatic aligner, you can easily pre-process the dictionary to produce an unaligned version by stripping out all the null characters. Please note that there are null letters as well as null phonemes. If your method does require alignment, you will have to devise some way of handling null letters in the input.For letter-to-phoneme conversion, the intention is to use the dictionary itself as the gold standard for evaluation of conversion performance. If you choose to focus on alignment and/or syllabification, the very significant problem of finding an acceptable gold standard arises. Our intention is that alignment and/or syllabification are primarily evaluated by the contribution that they make to improved (hopefully!) letter-to-phoneme conversion.

Evaluation

For letter-to-phoneme conversion, evaluation will be by 10-fold cross validation. Each of the available datasets is already split into 10 subsets (`folds’) so that everyone is using the same ones. This will promote comparability of results. Take each fold in turn, train on the remaining nine folds and test on that particular fold. Repeat this 10 times, so that each of the 10 folds is used exactly once as the test set, and report the mean and standard deviation of the phoneme accuracy (and, if you wish, word accuracy also) computed over the 10 trials.

For other tasks (alignment, syllabification, stress assignment, …), you should follow the above procedure as far as possible.