17.1285, Review: Computational Ling/Semantics: Bond (2005)

Thu Apr 27 00:48:31 UTC 2006

LINGUIST List: Vol-17-1285. Wed Apr 26 2006. ISSN: 1068 - 4875.

Subject: 17.1285, Review: Computational Ling/Semantics: Bond (2005)

Moderators: Anthony Aristar, Wayne State U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org) 
        Sheila Dooley, U of Arizona  
        Terry Langendoen, U of Arizona  

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: Lindsay Butler <lindsay at linguistlist.org>
================================================================  

This LINGUIST List issue is a review of a book published by one of our
supporting publishers, commissioned by our book review editorial staff. We
welcome discussion of this book review on the list, and particularly invite
the author(s) or editor(s) of this book to join in. To start a discussion of
this book, you can use the Discussion form on the LINGUIST List website. For
the subject of the discussion, specify "Book Review" and the issue number of
this review. If you are interested in reviewing a book for LINGUIST, look for
the most recent posting with the subject "Reviews: AVAILABLE FOR REVIEW", and
follow the instructions at the top of the message. You can also contact the
book review staff directly.

===========================Directory==============================  

1)
Date: 17-Apr-2006
From: Shigeko Nariyama < shigeko at unimelb.edu.au >
Subject: Translating the Untranslatable 

-------------------------Message 1 ---------------------------------- 
Date: Wed, 26 Apr 2006 20:44:29
From: Shigeko Nariyama < shigeko at unimelb.edu.au >
Subject: Translating the Untranslatable 

Announced at http://linguistlist.org/issues/17/17-871.html 

AUTHOR: Bond, Francis
TITLE: Translating the untranslatable 
SUBTITLE: A solution to the problem of generating English determiners
SERIES: CSLI Series in Computational Linguistics
PUBLISHER: CSLI
YEAR: 2005

Shigeko Nariyama, Asia Institute, University of Melbourne

SYNOPSIS

Here is the list of chapters with a brief description of each one.
Ch. 1. Introduction 
Ch. 2. Background: literature review on reference, countability, 
definiteness and thematic marking 
Ch. 3. Determiners and Number in Machine Translation: literature 
review on machine translation and related areas of natural language 
processing (NLP)
Ch. 4. Semantic Representation: a tractable representation of 
referentiality, boundedness (entities with or without salient boundary) 
and definiteness proposed
Ch. 5. Automatic Interpretation: the algorithms proposed that 
determine values for referentiality, boundedness and definiteness
Ch. 6. Evaluation and discussion: implementation of the algorithms 
and comparison with other systems 
Ch. 7. Construction of the Lexicon: compilation of the detailed 
knowledge used in the lexicon
Ch. 8. Automatic Acquisition of Lexical Information: acquired from 
existing dictionaries and corpora
Ch. 9. Conclusion

Chapters 2 ,4 and 7 are more linguistic in focus, while the remaining 
chapters are more computational NLP. Chapters 5, 6, and 8 are the 
heart of the book. It would have been better had Chapter 7 been 
placed immediately following Chapter 4.

DISCUSSION

'Translating the untranslatable' is exactly what this book is all about -- 
a challenge to accomplish a near impossible mission concerning 
languages -- generating required linguistic information in the target 
language from input sentences in the source language that apparently 
contain no such information! 

'Determiners' cover a wide range of linguistic phenomena, including 
in/definite articles (i.e. a/the) or null articles, possessive pronouns, 
and number (and therefore they relate to generic referents, 
countability and numeral classifiers). Although all of these are 
syntactically obligatory in English and must be appropriately reflected 
in every sentence, none of these except for numeral classifiers are 
grammaticalised in Japanese. Hence, it is easy to see the magnitude 
of difficulties in generating English sentences with correct determiners 
and number from Japanese sentences that contain no overt 
information concerning these. For example, inu 'dog' can be 'a 
dog', 'the dog', or 'dogs'. 

Being a native speaker of Japanese myself and speaking English as a 
second language, I learned a greatly deal from reading this book. As 
mentioned in the book, articles and numbers are the most frequent 
types of errors for Japanese, ranging from 9%~18% depending on 
one's competence. The problem of incorrect use of determiners is 
more serious than it may appear, since the difference in the use of 
incorrect determiners can result not only in wrong nuances of 
sentences, but also in referring to different entities.        

This book is truly comprehensive and has something for everyone. 
Apart from the benefits for second language acquisition mentioned 
above, it examines the issue of determiners and number from 
theoretical linguistics, computational linguistics, and various 
applications in NLP and generation, including machine translation 
systems, on which this book is focused. It is easy to read, particularly 
because of the range of appropriate examples. Japanese fonts in the 
examples make reading so much easier for Japanese speakers, as 
Japanese words have an abundance of homonyms.

Moving onto the details of the book, given the frequent absence of 
determiners and number in Japanese, the solution to the issues of 
determiners and number has to be sought elsewhere in the sentence. 
Lexical knowledge is one good source, convincingly discussed in 
Chapter 7. Determiners and number also have an intricate relation to 
discourse elements, such as the notions of topic and familiarity. 

All of these linguistic phenomena and discourse elements that play an 
important role for determiners and number are complex issues on their 
own, and none of them have been satisfactory accounted for, let 
alone comprehensive treatment of determiners and number. For 
example, the various definitions of 'definiteness' in English have been 
proposed: e.g., uniqueness, discourse given, familiarity. However, 
corpus analysis shows that 21% of the definite articles are used even 
for unfamiliar and discourse new entities (Poesio 2004), and thus the 
definition has not reached consensus among English speakers (see 
the series of work by Poesio). Even among those languages that use 
in/definite articles, definiteness is often language specific. Hence, 
determiners and number have been known to be a perennial problem 
particularly in NLP. Because computers do not have the faculty to rely 
on intuition that humans can utilise, they require explicit procedures 
for generating determiners and number. 

As a solution to the issues, Bond proposed three algorithms 
concerning 1) referentiality, 2) number and countability, and 3) 
definiteness. These algorithms combine a deep semantic analysis with 
the use of sensible defaults. They were tested in the wide-coverage 
Japanese to English machine translation system ALT-J/E. The result 
reported is highly promising: generating determiners (articles and 
possessives, to be more precise) at an accuracy of over 85%. The 
methodology and evaluation seem sound, as it was tested on 398 
sentences with 3,000 NPs.

While this high accuracy may not always be maintained in other 
domains of texts, it is still highly promising given all the complexities 
associated with the issues. The author ought to be congratulated for 
his achievement. 

I found Chapter 8 particularly meaningful. Bond successfully shows 
with high precision and F-score that countability of unknown words 
including multi-(compound) words can be automatically learnt with a 
precision rivalling manual annotation. It is acquired from semantic 
classes and corpora. The main obstacle there lies in distinguishing 
different senses of a word. For example, both countable and 
uncountable usages of 'interest' are in corpora; countable for the 
sense 'a sense of concern with and curiosity', while uncountable 
for 'fixed charge for borrowing money'.

The main area I would like to criticize is on the way of capturing the 
relationship of definiteness with the Japanese thematic marking wa 
and the nominative case marker ga. Bond treats wa (also mo) as 
definite in the algorithm. The relevant discussion is found in Watanabe 
(1989:140-1), who reports that 99.5% of wa-marked arguments are 
definite, whereas only 61.6% of ga-marked arguments are definite, 
and, as a reference point, 100% of elided arguments are definite (ibid. 
75-154). Looking at the issue of definiteness from another 
perspective, 69.9% of definite subjects are marked by wa and 30.1% 
by ga, while 1.7% of indefinite subjects are marked by wa and 98.3% 
by ga.

In principle, ga-marked arguments are indefinite, unless denoting an 
exhaustive listing, which connotes an emphasis (see Kuno 1973). I 
suspect that one of the other reasons why 30.1% of definite 
arguments are marked by ga has to do with it appearing in the 
subordinate clause; the subject in a subordinate clause must be 
marked by ga, irrespective of definiteness. I do not have access to 
Watanabe's corpus to check this point. Even though Watanabe did not 
specify her definition of definiteness in the analysis and the 
definiteness there may not necessarily correspond to the use of the, 
these findings are still sufficient enough to vindicate that the 
differences between wa and ga have indeed a strong correlation with 
definiteness. 

Furthermore, the classification of the use of wa and ga described in 
Figure 4 quoted from Hinds (1987) may not be the best representation 
for capturing the difference between the two, because three out of 
seven categories show both wa and ga as the possibility. Watanabe 
(1989: 162) offers a more precise representation of mental processing 
of wa and ga in relation to ellipsis (zero anaphor).

Finally, I totally agree with Bond that discourse contexts will improve 
generating more accurate determiners that have anaphoric relations. 
And this is the overall future direction of work in linguistics and NLP, 
including determiners. That is, to deal with discourse (sequence of 
sentences), not just isolated sentences. 

REFERENCES

Hinds. John. 1987. Thematization, assumed familiarity, staging, and 
syntactic binding in Japanese. In J. Hinds et al. (eds.), Perspective on 
Topicalization: The case of Japanese wa. 83-106.

Kuno, Susumu. 1973. The structure of the Japanese language. Mass: 
MIT Press.

Poesio, Massimo. 2004. An empirical investigation of definiteness. 
Proceedings of International Conference on Linguistic Evidence, 
Tuebingen.

Vieira, Renata and Massimo Poesio. 2000. Processing definite 
descriptions in corpora, In S. Botley and T. McEnery (eds.), Corpus-
based and computational approaches to anaphora, UCL Press.

Watanabe, Yashuko. 1989. The function of ''WA'' and ''GA'' in 
Japanese discourse. Eugene: University of Oregon Ph.D Dissertation. 

ABOUT THE REVIEWER

Shigeko Nariyama is a lecturer at the Asia Institute, the University of 
Melbourne, Australia. Her main research area is zero anaphora, along 
with lexical semantics, pragmatics and world knowledge that 
contribute to resolving zero anaphora.

-----------------------------------------------------------
LINGUIST List: Vol-17-1285