Dear Coen,<div><br></div><div>The Java-based Kyoto Language Model library</div><div><a href="http://www.phontron.com/kylm/">http://www.phontron.com/kylm/</a></div><div>has an implementation of modKN smoothing that should be good enough to</div>
<div>look at (or actually use), and simple enough for ordinary mortals to understand.</div><div>And yes, you do the full backing off thing from a trigram model to a (modified,</div><div>not actual counts!) bigram model to a (modified!) unigram model. At the unigram</div>
<div>level, you have to use the usual tricks of either using an UNK token for very rare</div><div>words or a character-based model of new words, or ...</div><div><br></div><div>Best,</div><div>Yannick</div><div><br></div>
<div><br><div class="gmail_quote">On Thu, Jun 21, 2012 at 4:20 PM, Coen Jonker <span dir="ltr"><<a href="mailto:coen.j.jonker@gmail.com" target="_blank">coen.j.jonker@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<span style="font-family:arial,sans-serif;font-size:13px;border-collapse:collapse;color:rgb(34,34,34)"><div>Dear readers of the corpora list,</div>
<div><br></div><div><br></div><div>As a part of the AI-master course handwriting recognition I am working on the implementation of a Statistical Language Model for 19th century Dutch. I am running into a problem and hope you may be able to help. I have already spoken with prof. Ernst Wit and he suggested I contacted you. I would be very grateful if you could help me along. </div>
<div><br></div><div>The purpose of the statistical language model is to provide a knowledge-based estimation for the conditional probability of a word w given the history h (previous words), let this probability be P(w|h).</div>
<div><br></div><div>Since the available corpus for this project is quite sparse I want to use statistical smoothing on the conditional probabilities. I have learned that using a simple maximum likelihood estimation for P(w|h) will yield zero probabilities for word sequences that are not in the corpus, even though many grammatically correct sequences are not in the corpus. Furthermore, the actual probabilities for P(w|h) will be overestimated by maximum likelihood.</div>
<div><br></div><div>There are many smoothing techniques available, but empirically a modified form of Kneser-Ney smoothing has been proven very effective (I have attached a paper by Stanley Chen and Joshua Goodman explaining this). A quick intro on the topic is on: <a href="http://www.youtube.com/watch?v=ody1ysUTD7o" style="color:rgb(17,85,204)" target="_blank">http://www.youtube.com/watch?v=ody1ysUTD7o</a></div>
<div><br></div><div>The Kneser-Ney smoothing interpolates discounted probabilities for trigrams with lower order bigram probabilities. The equations on page 12 (370 in the journal numbering) of the attached PDF are the ones I use. The problem I run into is that the denominator of the fraction, which is the count of the history h in the corpus may be zero, yielding errors, but also making the gamma-term zero, yielding zero-probabilities. Avoiding zero probabilities was one of the reasons to implement smoothing in the first place.</div>
<div><br></div><div>This problem has frustrated me for a few weeks now, after reading most of the available literature on the topic I am afraid that my knowledge of language modeling or statistics may be insufficient or that I misunderstood a fundamental part of the technique.</div>
<div><br></div><div>Did I misunderstand anything? I sincerely hope you are able to point me in the direction of a solution.</div><div><br></div><div>Sincerely,</div><div><br></div><div>Coen Jonker</div></span>
<br>_______________________________________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
<br></blockquote></div><br></div>