[Corpora-List] A Problem About Chinese Language Modeling

Philipp Koehn pkoehn at inf.ed.ac.uk
Tue Feb 10 13:30:32 UTC 2009


Hi,

in machine translation, we see benefits of word segmentation.
The trade-off is that an n-gram over words is able to include
more context than a n-gram over characters. There may be
a problem with the perplexity numbers you compute. If it is
average perplexity per token (as it usually is measured), then
the higher perplexity for the word model over the character
model is misleading. If this is the case, adjusting the word-
based perplexity is more informative...

-phi

On Tue, Feb 10, 2009 at 10:58 AM, 张坤鹏 <smallfish at mail.nankai.edu.cn> wrote:
> Hello everyone,
>   I want to build a chinese language model with a corpus of size 1.1G or so.
> Now I have a question, is it better to count on the character level or on
> the word level (or on a even higher level like phrases). Since the
> vocabulary size of chinese word is much larger than that of character, the
> order of character-based model may be higher than the word-based model. I
> made an experiment with a smaller corpus, whose result shows that the ppl
> with word-based model is much bigger than with character-based model, (at
> least partially) because there are more OOVs in the first model than the
> second. But if fine-granularity is preferred, why don't we model English on
> character level rather than word level?
> I am grateful if anyone can give me some suggestions on this problem.
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list