[Corpora-List] A Problem About Chinese Language Modeling

Tue Feb 10 10:58:51 UTC 2009

Hello everyone,
  I want to build a chinese language model with a corpus of size 1.1G or so. Now I have a question, is it better to count on the character level or on the word level (or on a even higher level like phrases). Since the vocabulary size of chinese word is much larger than that of character, the order of character-based model may be higher than the word-based model. I made an experiment with a smaller corpus, whose result shows that the ppl with word-based model is much bigger than with character-based model, (at least partially) because there are more OOVs in the first model than the second. But if fine-granularity is preferred, why don't we model English on character level rather than word level?
I am grateful if anyone can give me some suggestions on this problem.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090210/99dcd38e/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora