17.2259, Diss: Computational Ling: Lu: 'Hybrid Models for Chinese Unknown Wo...'

Mon Aug 7 16:03:35 UTC 2006

LINGUIST List: Vol-17-2259. Mon Aug 07 2006. ISSN: 1068 - 4875.

Subject: 17.2259, Diss: Computational Ling: Lu: 'Hybrid Models for Chinese Unknown Wo...'

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews: Laura Welcher, Rosetta Project / Long Now Foundation  
         <reviews at linguistlist.org> 

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: Hannah Morales <hannah at linguistlist.org>
================================================================  

To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.

===========================Directory==============================  

1)
Date: 05-Aug-2006
From: Xiaofei Lu < xflu at ling.osu.edu >
Subject: Hybrid Models for Chinese Unknown Word Resolution 

-------------------------Message 1 ---------------------------------- 
Date: Mon, 07 Aug 2006 12:01:05
From: Xiaofei Lu < xflu at ling.osu.edu >
Subject: Hybrid Models for Chinese Unknown Word Resolution 

Institution: Ohio State University 
Program: Department of East Asian Languages and Literature 
Dissertation Status: Completed 
Degree Date: 2006 

Author: Xiaofei Lu

Dissertation Title: Hybrid Models for Chinese Unknown Word Resolution 

Dissertation URL:  http://ling.osu.edu/~xflu/papers/2006diss.pdf

Linguistic Field(s): Computational Linguistics

Subject Language(s): Chinese, Mandarin (cmn)

Dissertation Director(s):
Walt Detmar Meurers

Dissertation Abstract:

Word segmentation, part-of-speech (POS) tagging, and sense tagging are
important steps in various Chinese natural language processing (CNLP)
systems. Unknown words, i.e., words that are not in the dictionary or
training data used in a CNLP system, constitute a major challenge for each
of these steps. This dissertation is concerned with developing hybrid
models that effectively combine statistical, knowledge-based, and machine
learning approaches for Chinese unknown word resolution, including the
identification, part-of-speech (POS) tagging, and sense tagging of Chinese
unknown words. What makes Chinese unknown word resolution hard is the
limited information available for predicting the properties of unknown
words, and for this reason it is crucial to make optimal use of information
that is available. To this end, this research explores two central ideas
and aims to achieve two major goals. 

First, the morphological, syntactic, and semantic information of the
component characters or morphemes of an unknown word provides useful
insights into its structural and semantic properties. The first goal of
this work is to develop novel algorithms that capture such insights. To 
integrate unknown word identification with word segmentation, the notion of
character-based tagging is adopted to model the tendency of individual
characters to combine with adjacent characters to form words in different
contexts. To predict the POS categories of unknown words, morphological
rules that encode knowledge about the relationship between the POS
categories of unknown words and those of their component morphemes are
developed. Finally, to classify unknown words into appropriate semantic
categories in a Chinese thesaurus, rules that capture the regularities in
the relationship between the semantic categories of unknown words and those
of their component morphemes are developed; information-theoretical models
are used to compute the associations between individual morphemes and
semantic categories for the same purpose.

Second, in addition to information about the component characters of an
unknown word, information about its type, length, and internal structure as
well as the context in which it occurs provides useful insights into its
properties, too. Existing approaches to Chinese unknown word resolution
tend to use different, but single sources of information and are often
effective in handling different subsets of unknown words. The second goal
of this research is to identify the relative strengths of novel and
existing models and to combine them to achieve optimal use of information
and better performance for the task. 

-----------------------------------------------------------
LINGUIST List: Vol-17-2259