9.488, Review: Boguraev & Pustejovsky: Corpus Processing.

Mon Mar 30 10:15:33 UTC 1998

LINGUIST List:  Vol-9-488. Mon Mar 30 1998. ISSN: 1068-4875.

Subject: 9.488, Review: Boguraev & Pustejovsky: Corpus Processing.

Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at linguistlist.org>
            Helen Dry: Eastern Michigan U. <hdry at linguistlist.org>

Review Editor:     Andrew Carnie <carnie at linguistlist.org>

Editors:  	    Brett Churchill <brett at linguistlist.org>
		    Martin Jacobsen <marty at linguistlist.org>
		    Elaine Halleck <elaine at linguistlist.org>
                    Anita Huang <anita at linguistlist.org>
                    Ljuba Veselinova <ljuba at linguistlist.org>
		    Julie Wilson <julie at linguistlist.org>

Software development: John H. Remmers <remmers at emunix.emich.edu>
                      Zhiping Zheng <zzheng at online.emich.edu>

Home Page:  http://linguistlist.org/

Editor for this issue: Andrew Carnie <carnie at linguistlist.org>
 ==========================================================================

What follows is another discussion note contributed to our Book Discussion
Forum.  We expect these discussions to be informal and interactive; and
the author of the book discussed is cordially invited to join in.

If you are interested in leading a book discussion, look for books
announced on LINGUIST as "available for discussion."  (This means that
the publisher has sent us a review copy.)  Then contact Andrew Carnie at
     carnie at linguistlist.org

=================================Directory=================================

1)
Date:  Wed, 18 Mar 1998 09:58:37 -0500
From:  Kevin Cohen <kevin at cmhcsys.com>
Subject:  Review of Boguraev and Pustejovsky

-------------------------------- Message 1 -------------------------------

Date:  Wed, 18 Mar 1998 09:58:37 -0500
From:  Kevin Cohen <kevin at cmhcsys.com>
Subject:  Review of Boguraev and Pustejovsky

Branimir Boguraev and James Pustejovsky.  1996.  Corpus processing for
	lexical acquisition.  MIT Press: Cambridge, Massachusetts.  245 pages.
	  $32.50.

The term "acquisition" in the title of this book refers to automatic
learning---acquisition not by human children, but by natural language
systems.  The papers in this book deal with the topic of building and
refining lexica for natural language systems automatically--i.e. by
computer, with little or no human intervention--from large corpora.

Building lexica for natural language systems by hand is difficult, expensive,
and labor-intensive, and the result may be out of date before it is completed.
Furthermore, by the standards of earlier systems, lexica have become
enormous.  Continuous speech dictation systems ship with active vocabularies
in the range of 30,000 lexical items.  Lexica in production by one company
are expected to have 200,000 entries for American English and 700,000
entries for German. So, from an industrial point of view, work on the
automatic acquisition of lexical knowledge is very welcome.  This is not
to say that automatic lexical acquisition should be of interest only to
applied linguists.  Lexical information is also necessary in psycholinguistic
research, and some of the work in this volume shows such application.
Furthermore, the sorts of data that researchers in this field are attempting
to acquire is just the sort of data that is needed for large-scale
applications of formalisms like Head-Driven Phrase Structure Grammar.
So, the work described in this book should be of interest to academic,
as well as industrial, linguists.

This book is the result of a workshop, and as such, it has the usual
scattering of topics seen in proceedings.  This should be seen as a
feature, not a bug: the result is that there is something here for
everyone.  Various papers come from the fields of corpus linguistics,
statistical analysis of language, psycholinguistics, rule acquisition,
semantics, and lexical acquisition.  The papers are divided into five
broad categories: (1) unknown words, (2) building representations,
(3) categorization, (4) lexical semantics, and (5) evaluation.  In
addition, a paper by the editors lays out the reasons for, and challenges
of, automatic acquisition of lexical information.

(1) Introduction

Issues in text-based lexicon acquisition, Branimir Boguraev and James
Pustejovsky.  This paper presents an in-depth answer to the question
with which lexicon builders are perenially plagued by anyone to whom
they try to explain their work: why not just use an on-line dictionary?
The short answer is that such dictionaries are static and do not evolve at
the same pace as the language that they are attempting to describe.  The
long answer is that natural language systems require information that is
not reflected in traditional dictionaries-semantic feature geometries,
subcategorization frames, and so on.  So: "the fundamental problem of
lexical acquisition... is how to provide, fully and adequately, the
systems with the lexical knowledge they need to operate with the proper
degree of efficiency.  The answer... to which the community is converging
today... is to extract the lexicon from the texts themselves" (3).
Automatic lexical acquisition can trivially solve the short-answer problem
by allowing updating as frequently as new data can be acquired.  More
importantly, it allows the linguist to define the questions that they
would like the lexicon to answer, rather than having those questions
chosen for them by the dictionary maker.

(2) Dealing with unknown words

Consider a spell-checking program that encounters the (unknown) word
"Horowitz."  The spell checker would like to know the best action to
take with this word: is it a mis-spelling that should be replaced with
something else, or is it a precious datum that should be added to its
lexicon?  The spell-checker asks its user; the papers in this section
discuss attempts to answer these questions automatically.

Linguists tend not to pay much attention to proper nouns.  As McDonald
puts it in an epigram to his paper in this volume, "proper names are
the Rodney Dangerfield of linguistics.  They don't get no respect" (21).
Thus, it might surprise the reader to find that all three of the papers in
this section deal with names.  The identification and classification of
names is, in fact, of considerable interest in natural language systems.
For relatively uninflected languages like English, names may constitute
the majority of unknown words encountered in a corpus.  Names raise
special issues for classification, including the facts that they may
have multiple forms; multiple forms may have the same referent in a
single text, raising problems for reference and coindexation; and,
on a less theoretically interesting but no less morally and legally
compelling level, they may require special treatment in the corpus.
For instance, proper names are routinely removed from medical data,
and may need to be removed from sociolinguistic data, as well.

Internal and external evidence in the identification and semantic
categorization of proper names.  David D. McDonald.  This paper is
written in the language of artificial intelligence.  It describes
the Proper Name Facility of the SPARSER system.  It describes the
use of context-sensitive rewrite rules to analyze "external evidence"
for proper names, e.g. their combinatorial properties.  A surprising
and impressive aspect of the system described here is that it does not
use stored lists of proper nouns.

Identifying unknown proper names in newswire text.  Inderjeet Mani,
T. Richard MacMillan.  This paper describes a method of using contextual
clues such as appositives ("<name>, the daughter of a prominent local
physician" or "a Niloticist of great repute, <name>") and felicity
conditions for identifying names.  The contextual clues themselves
are then tapped for data about the referents of the names.

Categorizing and standardizing proper nouns for efficient information
retrieval.  Woojin Paik, Elizabeth D. Liddy, Edmund Yu, and Mary McKenna.
This paper deals with discovering and encoding relationships between groups
and their members.  Paik et al. state the problem as follows: "proper nouns
are... important sources of information for detecting relevant document in
information retrieval....Group proper nouns (e.g., "Middle East") and group
common nouns (e.g., "third world") will not match on their constituents
unless the group entity is mentioned in the document" (61).  The problem,
then, is to allow a search on "health care third world" to find a document
on "health care in Nicaragua."  The paper includes a short but useful
discussion of the problems that can arise with respect to prepositions
when noun phrases containing proper nouns are parsed as common noun phrases.
(The authors solved this problem by changing the ordering of two bracketing
routines.)

(3) Building representations

Customizing a lexicon to better suit a computational task.  Marti A. Hearst,
Hinrich Schuetze.  As mentioned above, lexicon building is expensive; this
paper describes a method for reducing development costs by customizing a
pre-existing lexicon, rather than building a new one.  The project described
here uses as its pre-existing lexicon WORDNET, an on-line lexicon that
contains information about semantic relationships such as hypernymy,
hyponymy, etc.  This was customized by reducing the resolution of the
semantic hierarchies to simple categories, and by combining categories
from "distant parts of the hierarchy.....We are interested in finding
grouping of terms that contribute to a frame or schema-like representation...
This can be achieved by finding associational lexical relations among the
existing taxonymic relations" (79).  Crucially, these relations should be
derived from a particular corpus.  The paper includes a nice description of
the algorithm used for collapsing semantic categories.

Towards building contextual representations of word senses using statistical
models.  Claudia Leacock, Geoffrey Towell, and Ellen M. Voorhees.  This paper
describes a method for differentiating amongst the multiple senses of a
polysemous word.  The authors discuss using "topical context," or content
words occurring in the vicinity, and "local context," which includes not
just content words but function morphemes, word order, and syntactic
structure.  They test three methods of acquiring topical context:
Bayesian, context vector, and a neural network.  They also give the
results of a psycholinguistic experiment comparing human performance
with machine performance, given the topical contexts created by the
three types of "classifiers."  Local context acquisition is based on
acquiring "templates," or specific sequences of words.  This paper
gives a particularly nice description of its algorithms, and is so
clearly written as to be suitable for presentation in courses on
statistics or psycholinguistics.

(4)   Categorization
A context driven conceptual clustering method for verb classification.
Roberto Basili, Maria-Teresa Pazienza, Paola Velardi.  This paper describes
a method of categorizing verbs with respect to thematic roles, drawing on
the COBWEB and ARIOSTO_LEX systems.  Its aim is to do categorization without
relying on "defining features," and to categorize with respect to the domain
of discourse.  The authors describe their algorithms, and the paper has a
nice literature review, covering both psycholinguistic and computational
perspectives on classification.

Distinguished usage.  Scott A. Waterman.  This paper tackles the
syntax/semantics interface.  The author attempts to give a linguistic
grounding to systems that map text to some knowledge base by means of
pattern matching: "by relating lexical pattern-based approaches to a
lexical semantic framework, such as the Generative Lexicon theory
[Pustejovsky, 1991], my aim is to provide a basis through which
pattern-based understanding systems can be understood in more conventional
linguistic terms.....My main contention is that such a framework can be
developed by viewing the lexical patterns as structural mappings from text
to denotation in a compositional lexical semantics...obviating the need for
separate syntactic and semantic analysis" (144).  This paper features an
excellent presentation of background ideas and explication of the issues
that it discusses.

(5) Lexical semantics

Detecting dependencies between semantic verb subclasses and subcategorization
frames in text corpora.  Victor Poznanski, Antonio Sanfilippo.  This paper
describes "a suite of programs....which elicit dependencies between semantic
verb classes and their...  subcategorization frames using machine readable
thesauri to assist in semantic tagging of texts" (176).  The system uses a
commercially available thesaurus-like online lexicon to do semantic tagging.
A "subcategorization frame" is then automatically extracted, and the
subcategorization frames are analyzed and classified.

Acquiring predicate-argument mapping information from multilingual texts.
Chinatsu Aone, Douglas McKee.  The authors hold predicate-argument mapping
to be equivalent to conceptual representation; as such, it is clearly
important to language understanding.  This is the only paper in the volume
that deals with bilingual corpora.

(6) Evaluating acquisition

Evaluation techniques for automatic semantic extraction: comparing
syntactic and window based approaches.  Gregory Grefenstette.  This
paper proposes techniques for comparing "knowledge-poor" approaches
to determining the degree of semantic similarity between two words.
A syntax-based method is compared to a windowing technique.  The
syntax-based method is shown to perform better for high-frequency words,
while the windowing method is the better performer for low-frequency words.

Conclusion

This is by no means an introductory text on automatic lexical acquisition.
Nonetheless, this volume contains papers that will appeal to workers in a
variety of linguistic disciplines.

The reviewer

K. Bretonnel Cohen is a linguist at Voice Input Technologies in Dublin,
Ohio, where his responsibilities include the construction of tools for
lexicon building and analysis.

---------------------------------------------------------------------------
LINGUIST List: Vol-9-488