Hello everyone,<BR>  I want to build a chinese language model with a corpus of size 1.1G or so. Now I have a question, is it better to count on the character level or on the word level (or on a even higher level like phrases). Since the vocabulary size of chinese word is much larger than that of character, the order of character-based model may be higher than the word-based model. I made an experiment with a smaller corpus, whose result shows that the ppl with word-based model is much bigger than with character-based model, (at least partially) because there are more OOVs in the first model than the second. But if fine-granularity is preferred, why don't we model English on character level rather than word level?<BR>I am grateful if anyone can give me some suggestions on this problem.