12.1755, Review: Melamed, Parallel Texts

The LINGUIST Network linguist at linguistlist.org
Fri Jul 6 17:29:52 UTC 2001


LINGUIST List:  Vol-12-1755. Fri Jul 6 2001. ISSN: 1068-4875.

Subject: 12.1755, Review: Melamed, Parallel Texts

Moderators: Anthony Aristar, Wayne State U.<aristar at linguistlist.org>
            Helen Dry, Eastern Michigan U. <hdry at linguistlist.org>
            Andrew Carnie, U. of Arizona <carnie at linguistlist.org>

Reviews (reviews at linguistlist.org):
	Simin Karimi, U. of Arizona
	Terence Langendoen, U. of Arizona

Editors (linguist at linguistlist.org):
	Karen Milligan, WSU 		Naomi Ogasawara, EMU
	Lydia Grebenyova, EMU		Jody Huellmantel, WSU
	James Yuells, WSU		Michael Appleby, EMU
	Marie Klopfenstein, WSU		Ljuba Veselinova, Stockholm U.
	Heather Taylor-Loring, EMU	Dina Kapetangianni, EMU

Software: John Remmers, E. Michigan U. <remmers at emunix.emich.edu>
          Gayathri Sriram, E. Michigan U. <gayatri at linguistlist.org>

Home Page:  http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.



Editor for this issue: Terence Langendoen <terry at linguistlist.org>
 ==========================================================================
What follows is another discussion note contributed to our Book Discussion
Forum.  We expect these discussions to be informal and interactive; and
the author of the book discussed is cordially invited to join in.

If you are interested in leading a book discussion, look for books
announced on LINGUIST as "available for discussion."  (This means that
the publisher has sent us a review copy.)  Then contact Simin Karimi at
     simin at linguistlist.org or Terry Langendoen at terry at linguistlist.org.


=================================Directory=================================

1)
Date:  Fri, 6 Jul 2001 12:32:01 -0400
From:  Mike Maxwell <Mike_Maxwell at sil.org>
Subject:  Review of Melamed "Empirical Methods for Exploiting Parallel Texts"

-------------------------------- Message 1 -------------------------------

Date:  Fri, 6 Jul 2001 12:32:01 -0400
From:  Mike Maxwell <Mike_Maxwell at sil.org>
Subject:  Review of Melamed "Empirical Methods for Exploiting Parallel Texts"

Review of Melamed, I. Dan (2001) Empirical Methods for
Exploiting Parallel Texts. MIT Press, xi + 195 pp.,
$32.95. (publisher's announcement in Linguist List 12.622)

Mike Maxwell, Summer Institute of Linguistics

Two questions. Suppose you had a text in some language,
and a translation of that text into another language: a
machine-readable Rosetta Stone. How close could a computer
come to finding corresponding paragraphs, sentences, and
words in the two texts, knowing nothing (or very little)
about the two languages to begin with? The answer is,
surprisingly close.

The second question: Why should linguists care? Since this
review is for Linguist List (not for a computational
linguistics mailing list), I will try to answer this
question before proceeding to the review itself. Melamed
suggests a number of uses for computer tools that find
correspondences, including bilingual lexicography
(particularly for newer terminology, which may not have
found its way into published dictionaries, and for new or
rare word senses); other bilingual resources for
translators, such as translation examples; aids for
students of foreign languages (when the student is reading
a text and gets stuck, the aligned translation is available
for reference); and a sort of proofreader for detecting
omissions in translated text (the subject of chapter four,
see below).

I might also add that interlinear text is widely used by
field linguists, and perhaps the sorts of alignment tools
discussed by Melamed could be adopted to do interlinear
glossing semi-automatically. There are obstacles, however,
not the least of which is that languages with substantial
inflectional morphology present special problems for
automatic alignment; more on this later.

Bilingual texts might also help with machine-driven
syntactic annotation, since different languages often have
differing patterns of syntactic ambiguity (as pointed out by
Matsumoto and Utsuro 2000: 582, footnote 5); this is
largely unexplored, and Melamed does not comment on it.

Finally, it might be possible to create machine translation
tools more or less automatically from aligned bilingual
text; this is touched on in chapters seven through nine,
although the reality of what can be done at present
(lexicography) falls short of actual translation.

In short, a good deal of machine learning from bilingual
texts is possible, and linguists should care. This book,
then, is an elaboration on these issues.

In the acknowledgements, Melamed says that the book is a
revision of his dissertation. Nevertheless, it reads
rather like a collection of stand-alone chapters. Indeed
several of the chapters are revisions of work published
elsewhere: save for the addition of a few paragraphs
dealing with Chinese-English alignment, chapters two and
three are almost verbatim identical with an article
published in the journal of Computational Linguistics
(Melamed 1999), while chapter seven was previously
published (with rather more differences) in the same
journal a year later (Melamed 2000). Some of the remaining
chapters appear to be revisions of conference papers.
Because the chapters are almost stand-alone, I will
depart from the usual practice in Linguist List reviews
of saving the evaluation until the end, instead
interspersing my comments where appropriate.

The chapters are arranged into three sections plus an
introductory chapter (previewing the following chapters)
and a summary chapter (reprising the preceding chapters,
and suggesting directions for future work).

The first section, 'Translation Equivalence among Word
Tokens', sets out the methodology of alignment, and one
application of alignment tools.

Chapter two (following the introductory chapter) describes
an algorithm for finding the alignment between the two
halves of a "bitext", that is, a bilingual text. The
algorithm requires that some correspondences between words
in the two halves of the bitext be known in advance, either
from a seed bilingual lexicon, or from cognates. (The term
'cognates' should be interpreted liberally: for instance,
numbers or dates could serve as cognates in certain texts.)
Given a pair of words in the two languages which are mutual
translations, there may be multiple occurrences of each in
the two halves of the bitext (and the number of occurrences
may not be the same, since there is often more than one way
to translate a given word). In theory, any pair of
occurrences could represent an alignment; in practice, the
true correspondences tend to occur in "corresponding
positions" in the two texts (where "corresponding position"
means something like "the same fraction of the way through
the text"). Melamed's algorithm capitalizes on this to find
the most likely correspondences between the two texts.

This sounds straightforward, but of course the difficulty
is in the details, and Melamed lays out in some detail the
way his implementation, SIMR ('Smooth Injective Map
Recognizer') finds the (usually) correct correspondences.
One of his innovations over other alignment algorithms is to
search for just a few correspondences at a time, gradually
extending the mapping from the beginning of the bitext.
In addition to running faster and in less memory (linear
time and space) on long texts, this innovation also allows
localized 'noise filters'. That is, a particular word and
its translation may be rare in texts, and therefore a good
indication overall of alignment. But at a particular
position in a text, the word may be quite frequent, and
therefore a source of confusion. Since Melamed's algorithm
works on a small stretch of text at a time, it can afford
to ignore words which are locally frequent--in effect, a
localized noise filter.

SIMR has a number of parameters, such as how long a stretch
of text it attempts to use at a time. For optimum speed,
these parameters must be set individually for each
language pair (and indeed, for different genera), and
Melamed uses training data (pre-aligned texts). The claim,
tested on French, Spanish, Korean and Chinese (with English
as the second language in each case) is that "it should be
possible to create enough hand-aligned [training] data to
port SIMR to any language pair in under two days" (page
33). The training data in one case was the Bible, with the
initial alignment being at verse boundaries. Significantly,
this sort of training data is available for nearly every
written language (Resnik, Olsen and Diab 1999).

A limitation of this methodology is that it requires
stemming words to canonical form. (This observation is not
limited to SIMR, but applies to many of the other programs
Melamed describes; there are, however, other alignment
algorithms to which it does not apply.) That is,
inflectional affixes must be removed, and any stem
allomorphy undone. While there are programs that attempt
to 'learn' morphology (see e.g. Goldsmith 2001), such work
is still experimental and limited; for now, the implication
for languages with inflectional morphology is that one must
first have a stemmer.

After correspondences have been found, there is the further
problem of finding alignments between the halves of the
bitext. Correspondences may cross; alignments do not. For
example, if the word order in two languages is different,
correspondences between the words in a given sentence will
likely cross, but the correspondences for two successive
sentences will likely not cross. The algorithm described
in chapter three finds alignments between 'segments' of
two texts for which correspondences have already been
discovered at a finer level of granularity, where a segment
may be a sentence, paragraph, list, etc. Tests on French-
English bitexts show Melamed's algorithm to be more
accurate than other alignment algorithms. At least as
important, its run time on long texts should be much
shorter than that of other published algorithms.

Chapter four shows that automatic alignment tools can be
used to automatically discover (larger) omissions in
translated texts. A test of this software found a number
of previously undetected omissions in a hand-aligned text
- a text which had been used as a standard of comparison
for computer alignment! This seems to be an excellent use
of computers for something which people are not good at.
At the same time, I had to wonder whether better tools for
the human translators would not have prevented the omissions
in the first place. In fact, it seems likely that much of
the high level alignment which is the topic of chapters two
and three could be better done in a translation tool, given
that human translators almost invariably translate
paragraph-by-paragraph (if not sentence-by-sentence) in
the first place. This would relegate need for software
used for higher level alignment to "legacy" texts. (Of
course, there are a great many such legacy texts now,
so perhaps this is a needless worry.)

Part two of this book is entitled 'The Type-Token
Interface', although it is not clear that the two chapters
it contains have much to do with each other. In chapter
five, Melamed describes a predicate, called a 'model of
co-occurrence', which given a region of the bitext and a
pair of word tokens, indicates whether those tokens co-
occur in that region. Such a predicate might be used to
help build a bilingual dictionary, for example. Others
have proposed such predicates before; the work described
here consists of (substantial) refinements.

Chapter six (and the appendix) describes a program for
manually marking correspondences between words in bitext.
The program was used by a number of annotators on 250
verses of the Bible, in French and English, with good
results: inter-annotator agreement was in the low 90%
range if function words were ignored (and somewhat less
counting function words). The resulting bitext with
correspondences marked is used as a gold standard against
which to test computer programs.

Part three turns to 'translation models', a general term
which refers in this context to a probabilistic
equivalence between the elements in the two halves of a
bitext. Such a model can be decomposed into sub-models in
various ways. For example, ignoring syntax, and indeed the
entire problem of relative word order, gives a word-to-word
translation model, the subject of chapter seven.
Conceptually, this is like a bilingual dictionary, except
that it includes (an approximation to) the relative
'importance' (frequency) of various translations of
a word (a property which may vary by topic and by genera).

Lest word-to-word translation models seem irrelevant or
naive, Melamed points out a number of applications,
including cross-language information retrieval, and
development and maintenance of bilingual lexicons for
machine translation. The methodology described in this
chapter seems sufficient for adding entries to such a
bilingual lexicon; the state of the art is not, apparently,
sufficient for a machine to determine the contexts in which
each translation of a word would be appropriate. (The
bilingual lexicon entries resulting from the translation
model described here must also be validated by humans.
Happily, the entries can be sorted by their probability of
being correct, which should make the validation task
easier.)

This chapter is probably the most mathematical of the book.
But the non-mathematician linguist should not feel that his
role is being usurped by statistical methods, for Melamed
is careful to point out that the field is ripe for
exploiting pre-existing (e.g. linguistic) knowledge: "each
infusion of knowledge about the problem domain yielded
better translation models" (page 121).

The next chapter looks at how bitexts can be exploited to
discover what Melamed calls "non-compositional compounds."
(The term refers to any sequence of words--not necessarily
contiguous--which is not transparently translated: idioms,
for example.) Melamed tested his methodology for finding
such "compounds" on a corpus of French-English text. A
random sample of the compounds range from genuine cases of
non-compositional translations ('shot-gun wedding' and
'blaze the trail', the latter presumably in its
non-idiomatic sense), to company names ('Generic
Pharmaceutical Industry Association'). Depending on your
view, this chapter shows that lexemes are not in one-to-one
correspondence with space-delimited words (in languages
whose orthography works that way), or it shows one way the
methodology of the previous chapter could be extended
beyond the word-to-word model.

The last substantive chapter (preceding the summary
chapter) describes an algorithm for automatically
discovering the word senses in a bitext. The question of
how to divide up the senses of a word is a controversial
one; Melamed finesses it by assuming that the senses of a
word in language X correspond approximately to the number
of words into which that word translates in language Y.
(This is an approximate limit; for instance, it is possible
that two words in language Y are actually synonyms.) The
problem is then to discover 'informants': evidence from the
context of the word in language X which can be used to
predict which way it will be translated into language Y.
Melamed achieves a statistically significant improvement in
translation accuracy--but it was astonishing to me how
small that improvement was: between one and two percentage
points. Melamed attributes the small improvement to limits
on what the program could use as 'informants': the five
words to the left and right of the word to be
disambiguated. A limit this is, but in the 1950s Abraham
Kaplan showed that for humans, even two words to the left
and right were as good as the entire text for
disambiguating word senses (Ide and Veronis 1998).

A list of sample ambiguities, together with the 'informants'
that Melamed's program found to disambiguate them in the
bitexts, offers a clue to why sense discovery is not more
helpful: the 'informants' are largely ad hoc. The English
word 'right', for example, is translated into French in at
least seven ways in the bitexts, one of which refers to the
direction. The informants for this sense are the words 'my'
and 'friend'. It turns out that this sense appeared
frequently in the phrase 'my friend to my right'! The lack
of generality here is obvious, and it is not clear how a
computer could do much better with that sort of data. As has
often been observed, word sense disambiguation is
'AI-complete': its resolution requires resolving all the
problems of artificial intelligence.

To my knowledge, the computer tools discussed in this book
have not been made available on the Web (although the 250
verse 'gold standard' described in chapter six is freely
available, as are a number of other tools and papers at
http://www.cis.upenn.edu/~melamed/).

Anyone wanting to know more about the uses (and
limitations) of bitexts will want to read this book
(although, as mentioned above, much of it has been published
elsewhere). The jacket blurb (reproduced at the MIT Press
web site) claims it is "a start-to-finish guide to designing
and evaluating many translingual applications." Melamed's
book is not (nor was it probably intended to be) a
start-to-finish guide, but this bit of publisher's hyperbole
should not detract from its usefulness. (For those needing
the 'start' of a 'start-to-finish guide', see Wu 2000. The
finish is not in sight yet, as Melamed makes clear.)

Finally, I wish to comment on the publisher. At a
reasonable price for a hard cover book, MIT Press has done
an excellent job of production. The format is clear,
illustrations and charts are reproduced well, and typos
seem to be few (more precisely, I could not find any).
There are publishers who would do well to imitate MIT Press
in these areas (if not with respect to the jacket blurb).


References

Dale, Robert; Hermann Moisi; and Harold Somers (editors).
   2000. The Handbook of Natural Language Processing.
   Marcel Dekker, Inc.

Goldsmith, John. 2001. "Unsupervised Learning of the
   Morphology of a Natural Language." Computational
   Linguistics 27: 153-198.

Ide, Nancy; and Jean Veronis. 1998. "Introduction to the
   Special Issue on Word Sense Disambiguation: The State of
   the Art." Computational Linguistics 24: 1-40.

Matsumoto, Yuji; and Takehito Utsuro. 2000. "Lexical
   Knowledge Acquisition." Pages 563-610 in Dale, Moisi and
   Somers 2000.

Melamed, I. Dan. 1999. "Bitext Maps and Alignment via
   Pattern Recognition." Computational Linguistics 25:
   107-130.

Melamed, I. Dan. 2000. "Models of Translational Equivalence
   among Words." Computational Linguistics 26: 221-249.

Resnik, Philip; Mary Broman Olsen; and Mona Diab. 1999.
   "Creating a Parallel Corpus from the Book of 2000
   Tongues."  Computers and Humanities 33:129-153.

Wu, Dekai. 2000. "Alignment." Pp. 415-458 in Dale, Moisl,
   and Somers 2000.


Mike Maxwell works in the development of computational
environments for syntactic, morphological and phonological
analysis for the Summer Institute of Linguistics. He has a
Ph.D. in linguistics from the University of Washington.


---------------------------------------------------------------------------

If you buy this book please tell the publisher or author
that you saw it reviewed on the LINGUIST list.

---------------------------------------------------------------------------
LINGUIST List: Vol-12-1755



More information about the LINGUIST mailing list