[Corpora-List] Question about smoothing of

Mon Jun 25 11:32:03 UTC 2012

Dear Coen,

The Java-based Kyoto Language Model library
http://www.phontron.com/kylm/
has an implementation of modKN smoothing that should be good enough to
look at (or actually use), and simple enough for ordinary mortals to
understand.
And yes, you do the full backing off thing from a trigram model to a
(modified,
not actual counts!) bigram model to a (modified!) unigram model. At the
unigram
level, you have to use the usual tricks of either using an UNK token for
very rare
words or a character-based model of new words, or ...

Best,
Yannick

On Thu, Jun 21, 2012 at 4:20 PM, Coen Jonker <coen.j.jonker at gmail.com>wrote:

> Dear readers of the corpora list,
>
>
> As a part of the AI-master course handwriting recognition I am working on
> the implementation of a Statistical Language Model for 19th century Dutch.
> I am running into a problem and hope you may be able to help. I have
> already spoken with prof. Ernst Wit and he suggested I contacted you. I
> would be very grateful if you could help me along.
>
> The purpose of the statistical language model is to provide a
> knowledge-based estimation for the conditional probability of a word w
> given the history h (previous words), let this probability be P(w|h).
>
> Since the available corpus for this project is quite sparse I want to use
> statistical smoothing on the conditional probabilities. I have learned that
> using a simple maximum likelihood estimation for P(w|h) will yield zero
> probabilities for word sequences that are not in the corpus, even though
> many grammatically correct sequences are not in the corpus. Furthermore,
> the actual probabilities for P(w|h) will be overestimated by maximum
> likelihood.
>
> There are many smoothing techniques available, but empirically a modified
> form of Kneser-Ney smoothing has been proven very effective (I have
> attached a paper by Stanley Chen and Joshua Goodman explaining this). A
> quick intro on the topic is on: http://www.youtube.com/watch?v=ody1ysUTD7o
>
> The Kneser-Ney smoothing interpolates discounted probabilities for
> trigrams with lower order bigram probabilities. The equations on page 12
> (370 in the journal numbering) of the attached PDF are the ones I use. The
> problem I run into is that the denominator of the fraction, which is the
> count of the history h in the corpus may be zero, yielding errors, but also
> making the gamma-term zero, yielding zero-probabilities. Avoiding zero
> probabilities was one of the reasons to implement smoothing in the first
> place.
>
> This problem has frustrated me for a few weeks now, after reading most of
> the available literature on the topic I am afraid that my knowledge of
> language modeling or statistics may be insufficient or that I misunderstood
> a fundamental part of the technique.
>
> Did I misunderstand anything? I sincerely hope you are able to point me in
> the direction of a solution.
>
> Sincerely,
>
> Coen Jonker
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120625/c89a2f65/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora