[Corpora-List] Re: [Corpora-list] Incidence of MWEs

Kit Chun Yu ctckit at cityu.edu.hk
Sat Mar 18 01:39:15 UTC 2006


why not think about this kind of issues form the perspective of 
tokenization for NLP?
(a very old paper: Webster & Kit, "Tokenization as the initial phase in 
NLP", COLING-92 1106-1110.)
a very simple idea: anything that are not to be further decomposed into 
any smaller fragments are simply treated as a token.
what is a token (or atomic text unit, which may have its own internal 
structure) seems to be application-dependent.
we may have mono-word and multi-word tokens, incl. continuous and 
discontinuous (or noncontiguous) ones (or MWEs).
accordingly, we can have something like this for tagging: <t ..> <w 
..>... </w> <w..>... </w> ... </t>
we may need some more sophisticated tagging for discontinuous ones, of 
course.
just to put in my two cents.
best,

Chunyu Kit, PhD
Assistant Professor in Computational Linguistics

Dept. of Chinese, Translation & Linguistics
City University of Hong Kong
83 Tat Chee Ave., Kowloon

E-mail:ctckit at cityu.edu.hk
http://personal.cityu.edu.hk/~ctckit/
Fax: (+852)2788 8706, 2788 8732
Tel: (+852)2788 9310 (O), 9380 1738 (M)
     (+86)136 5881 2972 (China Mobile)



More information about the Corpora mailing list