22.2758, Review: Computational Linguistics: Anastasiou (2010)

Tue Jul 5 22:59:12 UTC 2011

LINGUIST List: Vol-22-2758. Tue Jul 05 2011. ISSN: 1068 - 4875.

Subject: 22.2758, Review: Computational Linguistics: Anastasiou (2010)

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews: Veronika Drake, U of Wisconsin-Madison  
Monica Macaulay, U of Wisconsin-Madison  
Rajiv Rao, U of Wisconsin-Madison  
Joseph Salmons, U of Wisconsin-Madison  
Anja Wanner, U of Wisconsin-Madison  
       <reviews at linguistlist.org> 

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, 
and donations from subscribers and publishers.

Editor for this issue: Monica Macaulay <monica at linguistlist.org>
================================================================  

This LINGUIST List issue is a review of a book published by one of our
supporting publishers, commissioned by our book review editorial staff. We
welcome discussion of this book review on the list, and particularly invite
the author(s) or editor(s) of this book to join in. If you are interested in 
reviewing a book for LINGUIST, look for the most recent posting with the subject 
"Reviews: AVAILABLE FOR REVIEW", and follow the instructions at the top of the 
message. You can also contact the book review staff directly.

===========================Directory==============================  

1)
Date: 05-Jul-2011
From: Yuancheng Tu [yuanchengtu at gmail.com]
Subject: Idiom Treatment Experiments in Machine Translation

-------------------------Message 1 ---------------------------------- 
Date: Tue, 05 Jul 2011 18:57:04
From: Yuancheng Tu [yuanchengtu at gmail.com]
Subject: Idiom Treatment Experiments in Machine Translation

E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=22-2758.html&submissionid=4525458&topicid=9&msgnumber=1

Discuss this message: 
http://linguistlist.org/pubs/reviews/get-review.cfm?subid=4525458

Announced at http://linguistlist.org/issues/21/21-4572.html 

AUTHOR: Anastasiou, Dimitra 
TITLE: Idiom Treatment Experiments in Machine Translation 
PUBLISHER: Cambridge Scholars Publishing
YEAR: 2010

Yuancheng Tu, Department of Linguistics, University of Illinois at Urbana-Champaign

SUMMARY

Idiomatic expressions refer to various types of linguistic units or expressions,
including idioms, noun compounds, Named Entities, complex verb phrases and any
other habitual collocations. These linguistic units pose a particular challenge
in empirical Natural Language Processing (NLP) because they always have
idiosyncratic interpretations which cannot be formulated by directly aggregating
the semantics of their constituents (Sag et al., 2002). ''Idiom Treatment
Experiments in Machine Translation'' systematically reviews some theories of
idiomatic expressions and presents a method for recognizing these idiomatic
expressions in a corpus and translating them automatically within an
Example-based Machine Translation System, METIS-II. 

The author focuses on one particular type of idiomatic expression, idiomatic
Verb Phrases (iVPs)  in German and their translation to English.  The author
shows that the METIS-II system does the automatic translation with the help of a
bilingual dictionary, a monolingual corpus in the target language, and four
types of manually constructed morphosyntactic rules.  Three corpora from three
different resources are used to evaluate the results.  The first corpus consists
of 80 sentences sampled from Europarl (EP). The second has 275 sentences
filtered out from the web (MDS) and the last consists of 131 sentences
constructed from a part of the digital lexicon of the German language in the
20th Century (DWDS).  With a German-English idiom dictionary of 871 entries, the
system achieves over 80% precision, recall and F1 for all these three evaluation
corpora. 

The book consists of eleven chapters, which can be categorized into five
sections. The first chapter introduces the definition of translation, and the
motivation and contribution of the current research.  The next three chapters
review the literature on Machine Translation (MT). Chapter five extensively
reviews the theories of idiomatic expressions.  From chapter six to ten, the
author explains her experiments on MT for idiomatic expressions. Chapter eleven
is the conclusion and discussion of further research.

Chapter two of this book describes the history of MT from the perspective of
projects, companies and patents related to MT technology.  In chapter three, the
author introduces a brief history of Example-based Machine Translation (EBMT)
and compares it to another two popular MT frameworks, Rule-based Machine
Translation (RBMT) and Statistical Machine Translation (SMT).  The author
introduces EBMT as a system between RBMT and SMT. Similar to RBMT, its
translation rules are manually extracted. However, unlike RBMT, such translation
knowledge usually serves as templates and can be used repeatedly in the system.
 EBMT is similar to SMT in the sense that EBMT uses bilingual or monolingual
corpora to extract knowledge about sentence formation. However, it does not use
statistical models to decode the alignment or generate the translation. 

In chapter five, the author reviews the broad literature on theories of idioms.
 As stated in various previous works, it is concluded that idioms are mainly
multi-word expressions (MWEs) and no single universal definition works for all
of them. Idioms can be compositional or non-compositional, continuous and
non-continuous. In addition, idioms are also limitless since new idioms are
appearing in languages daily. These properties of idioms pose a substantial
challenge for recognizing and translating them automatically. 

Chapters six to ten explain the idiom treatment experiments conducted.  The
source idioms are iVPs in German and the target language is English.  These
idioms are either continuous or dis-continuous within a sentence.  In chapter
seven, the author introduces experiments with three commercial MT systems and
concludes that these systems cannot identify discontinuous idioms. In chapter
eight, she describes an RBMT system, CAT2, and conducts a small-scale experiment
with 58 sentences.  Since her evaluation achieves 100% precision and recall, she
concludes that CAT2 can handle iVP translation successfully.  Finally in
chapters nine and ten, the author discusses how the EBMT system, METIS-II,
treats iVP idioms with a German-English bilingual dictionary, four manually
constructed morphosyntactic rules and a monolingual corpus in English.  The
system assumes that the idioms are listed in the bilingual dictionary. For a
continuous idiom, only one rule is necessary to identify it within the sentence
and then do the dictionary look-up to translation.  The other three rules are
used to handle the cases where the iVPs are discontinuous within the sentence.
Sentences containing discontinuous idioms are constructed manually according to
the German topological field model in order to be identified by the
morphosyntactic rules.  The author conducted three small-scale evaluations on
three different data sets to evaluate the system, and the experiments show more
than 80% precision and recall for all experiments and for both continuous and
discontinuous iVPs. 

EVALUATION 

This book is structured clearly, from theoretical review to system description
and finally to system comparison and evaluation. It offers the reader a
relatively comprehensive view of theories of idioms, provides a brief history of
EBMT and introduces different stages to identify and translate idioms in one of
these EBMT systems.  The author lists ample iVP examples in German and shows
systematically how the EBMT system can translate them automatically.  However,
the method offered in this book only focuses on one specific idiom type, iVPs, 
and the sizes of the evaluation corpora used in this study are all very small.
The whole thesis would be significantly strengthened if the author would show
how the method used in the system to translate iVPs can be adapted to translate
other idiomatic phrases, and evaluated it with larger corpora. 

The book identifies several key challenges in MT for idiom translation. However,
the method described in this book does not seem to provide a general approach to
tackle these challenges. The first key challenge is the Out of Vocabulary (OOV)
problem related to idioms.  As mentioned in chapter five of this book, new
idioms are constantly appearing in languages through various communication
channels and updating these OOV idioms within any MT systems is a non-trivial
task.  However, the method provided in this book assumes the existence of all
idioms in the bilingual dictionary. To update OOV idioms, labor-intensive manual
maintenance of electronic dictionaries is required constantly within the system.
 In addition, the morphosyntactic rules within the system are also manually
constructed and different types of idioms need different rules. This constraint
also limits the scalability and adaptability of the proposed method.  The second
challenge mentioned in this book is to distinguish the literal and idiomatic
usage of idioms, and the author suggests manually constructing simple heuristics
and matching rules to handle this phenomenon.  Similar to the approach offered
by the author to solve the OOV problem, manually constructing rules for each
idiom usage is hard and very labor intensive. The author neglects solutions to
these challenges addressed in STM literature which offer more robust
alternatives to tackle these challenges in this field. 

One final note: there are some incongruities between certain chapters of this
book. For example, chapter four about Translation Memory, which is only remotely
related to the main thesis, could be incorporated in the previous chapter on the
history of EBMT. Chapter six, which is related to a historical view on idiom
treatment within MT systems, could also be included in the chapter on the
history of EBMT. In addition, chapter six lists several schemes on the
translation equivalence between source and target language. However, there is no
clear description in later chapters to show which scheme is used in the current
study. 

''Idiom Treatment Experiments in Machine Translation'' offers a specific approach
to handle a specific type of idioms within the framework of EBMT. It provides
valuable resources such as heuristics and rule templates for EBMT. However, the
proposed method, which consists of manually constructing rules and heuristics
for only one type of idioms in German, is not flexible enough to adapt to
translate other types of idioms, and is labor-intensive to maintain as well.  If
the book could survey some techniques used in SMT on how to tackle these
challenges posed by idioms, it would have a bigger impact and provide the
readers a more comprehensive view on automatic idiom translation. 

REFERENCES

I. Sag, T. Baldwin, F. Bond, and A. Copestake. 2002. Multiword expressions: A
pain in the neck for NLP. In Proceedings of the 3rd International Conference on
Intelligent Text Processing and Computational Linguistics, CICLing-2002, pages 1-15.

ABOUT THE REVIEWER 

Yuancheng Tu is a PhD student in the Department of Linguistics at the
University of Illinois at Urbana-Champaign. Her primary research interests
are Natural Language Processing (NLP), machine learning and computational
lexical semantics. She is also interested in structure learning in NLP and
Text Mining. She is now working on her PhD dissertation on recognizing and
learning of complex verb predicates, such as factive/imperative verbs,
light verb constructions and other inference rules with instantiated or
typed predicates.  Her dissertation proposes a general approach to handle
these complex verb predicates within the framework of lexical and
relational similarities and to use them in real NLP applications such as
the task of Textual Entailment. 

-----------------------------------------------------------
LINGUIST List: Vol-22-2758	
----------------------------------------------------------