[Corpora-List] Re: [Corpora-list] Incidence of MWEs

Kit Chun Yu ctckit at cityu.edu.hk
Sat Mar 18 01:39:15 UTC 2006

why not think about this kind of issues form the perspective of 
tokenization for NLP?
(a very old paper: Webster & Kit, "Tokenization as the initial phase in 
NLP", COLING-92 1106-1110.)
a very simple idea: anything that are not to be further decomposed into 
any smaller fragments are simply treated as a token.
what is a token (or atomic text unit, which may have its own internal 
structure) seems to be application-dependent.
we may have mono-word and multi-word tokens, incl. continuous and 
discontinuous (or noncontiguous) ones (or MWEs).
accordingly, we can have something like this for tagging: <t ..> <w 
..>... </w> <w..>... </w> ... </t>
we may need some more sophisticated tagging for discontinuous ones, of 
just to put in my two cents.

Chunyu Kit, PhD
Assistant Professor in Computational Linguistics

Dept. of Chinese, Translation & Linguistics
City University of Hong Kong
83 Tat Chee Ave., Kowloon

E-mail:ctckit at cityu.edu.hk
Fax: (+852)2788 8706, 2788 8732
Tel: (+852)2788 9310 (O), 9380 1738 (M)
     (+86)136 5881 2972 (China Mobile)

More information about the Corpora mailing list