Yes I am sure. I also developed this method in a Python script. If you want I could give it to you.<div>The idea is normalising the number of constrained N-grams by the number of all possible N-grams within your corpus.</div>
<div>In other words, the denominator do not depend at all by the input of the function. So, you could compute it one time for all. :)</div><div><br></div><div>Let me know.</div><div><br></div><div>Bye,</div><div>michele.<br>
<br><div class="gmail_quote">On Tue, Jun 26, 2012 at 8:55 AM, Wladimir Sidorenko <span dir="ltr"><<a href="mailto:wlsidorenko@gmail.com" target="_blank">wlsidorenko@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi Michele,<br>
<br>
Could you please explain that for me - the denominator of the 1-st<br>
addend is the sum of the counts of all c(w_{i-n+1}^{i}) over all w_{i}<br>
- I understood that as a number of times, the context w_{i-n+1}<br>
preceded any word in training corpus and not the overall number of all<br>
the n-grams in the corpus at all. To my mind it would is more<br>
reasonable, since if I estimate the probability of the tag pair `VB<br>
TO', I usually divide the number of times I saw the pair `VB TO' by<br>
the number of times I saw the tag `VB,' and not by the number of all<br>
the bigrams I've seen. But maybe that's my mistake. Are you sure that<br>
the denominator means all the possible n-grams (in particular all the<br>
possible bi- and higher order n-grams)? So far I've only seen such a<br>
strategy for the calculation of the unigram probabilities.<br>
<br>
Kind regards,<br>
Vladimir<br>
<br>
2012/6/26 Michele Filannino <<a href="mailto:michele.filannino@cs.manchester.ac.uk">michele.filannino@cs.manchester.ac.uk</a>>:<br>
<div class="HOEnZb"><div class="h5">> Hi Coen,<br>
><br>
> look carefully to the denominator of that formula (page 370). You will<br>
> easily spot that it refers to all the possible N-grams in your corpus, not<br>
> just that one constrained according a particular w_i. The meaning of that<br>
> denominator is just a counter of all the possible N-grams present in your<br>
> corpus. If you correctly interpret that counter you will easily understand<br>
> that your denominator cannot be equal to 0 (except in the case you do not<br>
> have any N-gram in it).<br>
><br>
> Let me know if you have understood your mistake.<br>
><br>
> Bye,<br>
> michele.<br>
><br>
> On Thu, Jun 21, 2012 at 3:20 PM, Coen Jonker <<a href="mailto:coen.j.jonker@gmail.com">coen.j.jonker@gmail.com</a>><br>
> wrote:<br>
>><br>
>> Dear readers of the corpora list,<br>
>><br>
>><br>
>> As a part of the AI-master course handwriting recognition I am working on<br>
>> the implementation of a Statistical Language Model for 19th century Dutch. I<br>
>> am running into a problem and hope you may be able to help. I have already<br>
>> spoken with prof. Ernst Wit and he suggested I contacted you. I would be<br>
>> very grateful if you could help me along.<br>
>><br>
>> The purpose of the statistical language model is to provide a<br>
>> knowledge-based estimation for the conditional probability of a word w given<br>
>> the history h (previous words), let this probability be P(w|h).<br>
>><br>
>> Since the available corpus for this project is quite sparse I want to use<br>
>> statistical smoothing on the conditional probabilities. I have learned that<br>
>> using a simple maximum likelihood estimation for P(w|h) will yield zero<br>
>> probabilities for word sequences that are not in the corpus, even though<br>
>> many grammatically correct sequences are not in the corpus. Furthermore, the<br>
>> actual probabilities for P(w|h) will be overestimated by maximum likelihood.<br>
>><br>
>> There are many smoothing techniques available, but empirically a modified<br>
>> form of Kneser-Ney smoothing has been proven very effective (I have attached<br>
>> a paper by Stanley Chen and Joshua Goodman explaining this). A quick intro<br>
>> on the topic is on: <a href="http://www.youtube.com/watch?v=ody1ysUTD7o" target="_blank">http://www.youtube.com/watch?v=ody1ysUTD7o</a><br>
>><br>
>> The Kneser-Ney smoothing interpolates discounted probabilities for<br>
>> trigrams with lower order bigram probabilities. The equations on page 12<br>
>> (370 in the journal numbering) of the attached PDF are the ones I use. The<br>
>> problem I run into is that the denominator of the fraction, which is the<br>
>> count of the history h in the corpus may be zero, yielding errors, but also<br>
>> making the gamma-term zero, yielding zero-probabilities. Avoiding zero<br>
>> probabilities was one of the reasons to implement smoothing in the first<br>
>> place.<br>
>><br>
>> This problem has frustrated me for a few weeks now, after reading most of<br>
>> the available literature on the topic I am afraid that my knowledge of<br>
>> language modeling or statistics may be insufficient or that I misunderstood<br>
>> a fundamental part of the technique.<br>
>><br>
>> Did I misunderstand anything? I sincerely hope you are able to point me in<br>
>> the direction of a solution.<br>
>><br>
>> Sincerely,<br>
>><br>
>> Coen Jonker<br>
>><br>
>> _______________________________________________<br>
>> UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
>> Corpora mailing list<br>
>> <a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>
>> <a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
>><br>
><br>
><br>
><br>
> --<br>
> Michele Filannino<br>
><br>
> CDT PhD student in Computer Science<br>
> Room IT301 - IT Building<br>
> The University of Manchester<br>
> <a href="mailto:filannim@cs.manchester.ac.uk">filannim@cs.manchester.ac.uk</a><br>
><br>
> _______________________________________________<br>
> UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
> Corpora mailing list<br>
> <a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>
> <a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
><br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>Michele Filannino<br><br><font color="#666666">CDT PhD student in Computer Science<br>Room IT301 - IT Building<br>The University of Manchester<br><a href="mailto:filannim@cs.manchester.ac.uk" target="_blank">filannim@cs.manchester.ac.uk</a></font><br>
</div>