[Corpora-List] A Problem About Chinese Language Modeling

Yannick Versley yannick.versley at unitn.it
Tue Feb 10 12:20:36 UTC 2009


Hello 张坤鹏,

why don't you just build one of each and make an interpolation of the two?
I'm not familiar enough with Chinese to see if it forcibly makes sense 
linguistically, but the litereature in language modeling is full of results 
along the lines of "... and we interpolated these two models and got nice 
perplexity improvements", so it might be worth trying.
If you're aiming for something *simple*, then a character-based model would 
probably be better since you wouldn't have to do word segmentation in the 
first place.
(Note that character n-grams don't have much meaning in letter-based 
languages, and even in German where you have lengthy synthetic compounds, 
it's a nontrivial task to split them, which is why people traditionally just 
treat those words as a single unit).

-Y.
> Hello everyone,
>   I want to build a chinese language model with a corpus of size 1.1G or so. 
Now I have a question, is it better to count on the character level or on the 
word level (or on a even higher level like phrases). Since the vocabulary 
size of chinese word is much larger than that of character, the order of 
character-based model may be higher than the word-based model. I made an 
experiment with a smaller corpus, whose result shows that the ppl with 
word-based model is much bigger than with character-based model, (at least 
partially) because there are more OOVs in the first model than the second. 
But if fine-granularity is preferred, why don't we model English on character 
level rather than word level?
> I am grateful if anyone can give me some suggestions on this problem.
>

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list