Yes I am sure. I also developed this method in a Python script. If you want I could give it to you.<div>The idea is normalising the number of constrained N-grams by the number of all possible N-grams within your corpus.</div>

<div>In other words, the denominator do not depend at all by the input of the function. So, you could compute it one time for all. :)</div><div><br></div><div>Let me know.</div><div><br></div><div>Bye,</div><div>michele.<br>

<br><div class="gmail_quote">On Tue, Jun 26, 2012 at 8:55 AM, Wladimir Sidorenko <span dir="ltr"><<a href="mailto:wlsidorenko@gmail.com" target="_blank">wlsidorenko@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi Michele,<br>

<br>

Could you please explain that for me - the denominator of the 1-st<br>

addend is the sum of the counts of all c(w_{i-n+1}^{i}) over all w_{i}<br>

- I understood that as a number of times, the context w_{i-n+1}<br>

preceded any word in training corpus and not the overall number of all<br>

the n-grams in the corpus at all. To my mind it would is more<br>

reasonable, since if I estimate the probability of the tag pair `VB<br>

TO', I usually divide the number of times I saw the pair `VB TO' by<br>

the number of times I saw the tag `VB,' and not by the number of all<br>

the bigrams I've seen. But maybe that's my mistake. Are you sure that<br>

the denominator means all the possible n-grams (in particular all the<br>

possible bi- and higher order n-grams)? So far I've only seen such a<br>

strategy for the calculation of the unigram probabilities.<br>

<br>

Kind regards,<br>

Vladimir<br>

<br>

2012/6/26 Michele Filannino <<a href="mailto:michele.filannino@cs.manchester.ac.uk">michele.filannino@cs.manchester.ac.uk</a>>:<br>

<div class="HOEnZb"><div class="h5">> Hi Coen,<br>

><br>

> look carefully to the denominator of that formula (page 370). You will<br>

> easily spot that it refers to all the possible N-grams in your corpus, not<br>

> just that one constrained according a particular w_i. The meaning of that<br>

> denominator is just a counter of all the possible N-grams present in your<br>

> corpus. If you correctly interpret that counter you will easily understand<br>

> that your denominator cannot be equal to 0 (except in the case you do not<br>

> have any N-gram in it).<br>

><br>

> Let me know if you have understood your mistake.<br>

><br>

> Bye,<br>

> michele.<br>

><br>

> On Thu, Jun 21, 2012 at 3:20 PM, Coen Jonker <<a href="mailto:coen.j.jonker@gmail.com">coen.j.jonker@gmail.com</a>><br>

> wrote:<br>

>><br>

>> Dear readers of the corpora list,<br>

>><br>

>><br>

>> As a part of the AI-master course handwriting recognition I am working on<br>

>> the implementation of a Statistical Language Model for 19th century Dutch. I<br>

>> am running into a problem and hope you may be able to help. I have already<br>

>> spoken with prof. Ernst Wit and he suggested I contacted you. I would be<br>

>> very grateful if you could help me along.<br>

>><br>

>> The purpose of the statistical language model is to provide a<br>

>> knowledge-based estimation for the conditional probability of a word w given<br>

>> the history h (previous words), let this probability be P(w|h).<br>

>><br>

>> Since the available corpus for this project is quite sparse I want to use<br>

>> statistical smoothing on the conditional probabilities. I have learned that<br>

>> using a simple maximum likelihood estimation for P(w|h) will yield zero<br>

>> probabilities for word sequences that are not in the corpus, even though<br>

>> many grammatically correct sequences are not in the corpus. Furthermore, the<br>

>> actual probabilities for P(w|h) will be overestimated by maximum likelihood.<br>

>><br>

>> There are many smoothing techniques available, but empirically a modified<br>

>> form of Kneser-Ney smoothing has been proven very effective (I have attached<br>

>> a paper by Stanley Chen and Joshua Goodman explaining this). A quick intro<br>

>> on the topic is on: <a href="http://www.youtube.com/watch?v=ody1ysUTD7o" target="_blank">http://www.youtube.com/watch?v=ody1ysUTD7o</a><br>

>><br>

>> The Kneser-Ney smoothing interpolates discounted probabilities for<br>

>> trigrams with lower order bigram probabilities. The equations on page 12<br>

>> (370 in the journal numbering) of the attached PDF are the ones I use. The<br>

>> problem I run into is that the denominator of the fraction, which is the<br>

>> count of the history h in the corpus may be zero, yielding errors, but also<br>

>> making the gamma-term zero, yielding zero-probabilities. Avoiding zero<br>

>> probabilities was one of the reasons to implement smoothing in the first<br>

>> place.<br>

>><br>

>> This problem has frustrated me for a few weeks now, after reading most of<br>

>> the available literature on the topic I am afraid that my knowledge of<br>

>> language modeling or statistics may be insufficient or that I misunderstood<br>

>> a fundamental part of the technique.<br>

>><br>

>> Did I misunderstand anything? I sincerely hope you are able to point me in<br>

>> the direction of a solution.<br>

>><br>

>> Sincerely,<br>

>><br>

>> Coen Jonker<br>

>><br>

>> _______________________________________________<br>

>> UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

>> Corpora mailing list<br>

>> <a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

>> <a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

>><br>

><br>

><br>

><br>

> --<br>

> Michele Filannino<br>

><br>

> CDT PhD student in Computer Science<br>

> Room IT301 - IT Building<br>

> The University of Manchester<br>

> <a href="mailto:filannim@cs.manchester.ac.uk">filannim@cs.manchester.ac.uk</a><br>

><br>

> _______________________________________________<br>

> UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

> Corpora mailing list<br>

> <a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

> <a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

><br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>Michele Filannino<br><br><font color="#666666">CDT PhD student in Computer Science<br>Room IT301 - IT Building<br>The University of Manchester<br><a href="mailto:filannim@cs.manchester.ac.uk" target="_blank">filannim@cs.manchester.ac.uk</a></font><br>


</div>