[Corpora-List] A Problem About Chinese Language Modeling
Yannick Versley
yannick.versley at unitn.it
Tue Feb 10 12:20:36 UTC 2009
Hello 张坤鹏,
why don't you just build one of each and make an interpolation of the two?
I'm not familiar enough with Chinese to see if it forcibly makes sense
linguistically, but the litereature in language modeling is full of results
along the lines of "... and we interpolated these two models and got nice
perplexity improvements", so it might be worth trying.
If you're aiming for something *simple*, then a character-based model would
probably be better since you wouldn't have to do word segmentation in the
first place.
(Note that character n-grams don't have much meaning in letter-based
languages, and even in German where you have lengthy synthetic compounds,
it's a nontrivial task to split them, which is why people traditionally just
treat those words as a single unit).
-Y.
> Hello everyone,
> I want to build a chinese language model with a corpus of size 1.1G or so.
Now I have a question, is it better to count on the character level or on the
word level (or on a even higher level like phrases). Since the vocabulary
size of chinese word is much larger than that of character, the order of
character-based model may be higher than the word-based model. I made an
experiment with a smaller corpus, whose result shows that the ppl with
word-based model is much bigger than with character-based model, (at least
partially) because there are more OOVs in the first model than the second.
But if fine-granularity is preferred, why don't we model English on character
level rather than word level?
> I am grateful if anyone can give me some suggestions on this problem.
>
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list