[Corpora-List] Re: [Corpora-list] Incidence of MWEs
Kit Chun Yu
ctckit at cityu.edu.hk
Sat Mar 18 01:39:15 UTC 2006
why not think about this kind of issues form the perspective of
tokenization for NLP?
(a very old paper: Webster & Kit, "Tokenization as the initial phase in
NLP", COLING-92 1106-1110.)
a very simple idea: anything that are not to be further decomposed into
any smaller fragments are simply treated as a token.
what is a token (or atomic text unit, which may have its own internal
structure) seems to be application-dependent.
we may have mono-word and multi-word tokens, incl. continuous and
discontinuous (or noncontiguous) ones (or MWEs).
accordingly, we can have something like this for tagging: <t ..> <w
..>... </w> <w..>... </w> ... </t>
we may need some more sophisticated tagging for discontinuous ones, of
course.
just to put in my two cents.
best,
Chunyu Kit, PhD
Assistant Professor in Computational Linguistics
Dept. of Chinese, Translation & Linguistics
City University of Hong Kong
83 Tat Chee Ave., Kowloon
E-mail:ctckit at cityu.edu.hk
http://personal.cityu.edu.hk/~ctckit/
Fax: (+852)2788 8706, 2788 8732
Tel: (+852)2788 9310 (O), 9380 1738 (M)
(+86)136 5881 2972 (China Mobile)
More information about the Corpora
mailing list