15.1347, Diss: Corpus Ling: Nishimoto: 'A Corpus-Based ...'

Thu Apr 29 15:44:04 UTC 2004

LINGUIST List:  Vol-15-1347. Thu Apr 29 2004. ISSN: 1068-4875.

Subject: 15.1347, Diss: Corpus Ling: Nishimoto: 'A Corpus-Based ...'

Moderators: Anthony Aristar, Wayne State U.<aristar at linguistlist.org>
            Helen Dry, Eastern Michigan U. <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org):
	Sheila Collberg, U. of Arizona
	Terence Langendoen, U. of Arizona

Home Page:  http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: Tomoko Okuno <tomoko at linguistlist.org>
 ==========================================================================
To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.
=================================Directory=================================

1)
Date:  Wed, 28 Apr 2004 13:05:54 -0400 (EDT)
From:  enishimoto at gc.cuny.edu
Subject:  A Corpus-Based Delimitation of New Words: Cross-Segment ...

-------------------------------- Message 1 -------------------------------

Date:  Wed, 28 Apr 2004 13:05:54 -0400 (EDT)
From:  enishimoto at gc.cuny.edu
Subject:  A Corpus-Based Delimitation of New Words: Cross-Segment ...

Institution: City University of New York
Program: Linguistics Program
Dissertation Status: Completed
Degree Date: 2004

Author: Eiji Nishimoto

Dissertation Title:
A Corpus-Based Delimitation of New Words: Cross-Segment Comparison and
Morphological Productivity

Linguistic Field:
Computational Linguistics,
Morphology,
Text/Corpus Linguistics

Dissertation Director 1: Dianne Bradley
Dissertation Director 2: Martin Chodorow
Dissertation Director 3: Virginia Teller

Dissertation Abstract:

The dissertation explores methods of identifying new words in a large
corpus of texts, the British National Corpus (BNC) of 100 million
English words, and of assessing productivity in derivational
affixation. Adopting a smoothing technique, deleted estimation, from
the Language Technology literature, we show that new words can be
detected when segments of a corpus are cross-compared to find which
word types are shared (or unshared). When each corpus segment is
created so as to reflect a set of words used by a group of randomly
sampled speakers, through a randomization respecting document
boundaries, the cross-comparison of corpus segments can be interpreted
as revealing the usage distribution of words across groups of
speakers. A word shared by fewer corpus segments is more limited in
its usage commonality and thus a more likely candidate for a new
word. Morphological productivity, the potential of a word formation
process involving an affix to form a new word, is assessed for 12
English derivational suffixes (nominal -ness, -ity, -er, -ee, -ion,
-ment, and -th; verbal -ize and -ify; adjectival -ish and -ous;
adverbial -ly), based on new words identified in the BNC via deleted
estimation. Quantifying the usage distribution of new word types
across corpus segments opens many possibilities for assessing the
productivity of affixes. Cross-comparing as few as two corpus segments
offers a crude yet computationally simple method of separating new
words (unshared) from non-new words (shared), to yield a productivity
index for a given affix. Cross-comparing as many as six corpus
segments supports a graded definition of a word's newness (words
shared by fewer corpus segments being more likely new) and thereby a
more detailed characterization of the productivity of affixes. The
proposed methods of identifying new words and assessing productivity
are shown to offer valuable insights into the issue of productivity in
word formation.

---------------------------------------------------------------------------
LINGUIST List: Vol-15-1347