<meta http-equiv="content-type" content="text/html; charset=utf-8"><span class="Apple-style-span" style="font-family:arial,sans-serif;font-size:13px;border-collapse:collapse;color:rgb(34,34,34)"><div>Dear readers of the corpora list,</div>

<div><br></div><div><br></div><div>As a part of the AI-master course handwriting recognition I am working on the implementation of a Statistical Language Model for 19th century Dutch. I am running into a problem and hope you may be able to help. I have already spoken with prof. Ernst Wit and he suggested I contacted you. I would be very grateful if you could help me along. </div>

<div><br></div><div>The purpose of the statistical language model is to provide a knowledge-based estimation for the conditional probability of a word w given the history h (previous words), let this probability be P(w|h).</div>

<div><br></div><div>Since the available corpus for this project is quite sparse I want to use statistical smoothing on the conditional probabilities. I have learned that using a simple maximum likelihood estimation for P(w|h) will yield zero probabilities for word sequences that are not in the corpus, even though many grammatically correct sequences are not in the corpus. Furthermore, the actual probabilities for P(w|h) will be overestimated by maximum likelihood.</div>

<div><br></div><div>There are many smoothing techniques available, but empirically a modified form of Kneser-Ney smoothing has been proven very effective (I have attached a paper by Stanley Chen and Joshua Goodman explaining this). A quick intro on the topic is on: <a href="http://www.youtube.com/watch?v=ody1ysUTD7o" target="_blank" style="color:rgb(17,85,204)">http://www.youtube.com/watch?v=ody1ysUTD7o</a></div>

<div><br></div><div>The Kneser-Ney smoothing interpolates discounted probabilities for trigrams with lower order bigram probabilities. The equations on page 12 (370 in the journal numbering) of the attached PDF are the ones I use. The problem I run into is that the denominator of the fraction, which is the count of the history h in the corpus may be zero, yielding errors, but also making the gamma-term zero, yielding zero-probabilities. Avoiding zero probabilities was one of the reasons to implement smoothing in the first place.</div>

<div><br></div><div>This problem has frustrated me for a few weeks now, after reading most of the available literature on the topic I am afraid that my knowledge of language modeling or statistics may be insufficient or that I misunderstood a fundamental part of the technique.</div>

<div><br></div><div>Did I misunderstand anything? I sincerely hope you are able to point me in the direction of a solution.</div><div><br></div><div>Sincerely,</div><div><br></div><div>Coen Jonker</div></span>