[Corpora-List] Question about smoothing of

Wladimir Sidorenko wlsidorenko at gmail.com
Tue Jun 26 07:55:42 UTC 2012


Hi Michele,

Could you please explain that for me - the denominator of the 1-st
addend is the sum of the counts of all c(w_{i-n+1}^{i}) over all w_{i}
- I understood that as a number of times, the context w_{i-n+1}
preceded any word in training corpus and not the overall number of all
the n-grams in the corpus at all. To my mind it would is more
reasonable, since if I estimate the probability of the tag pair `VB
TO', I usually divide the number of times I saw the pair `VB TO' by
the number of times I saw the tag `VB,' and not by the number of all
the bigrams I've seen. But maybe that's my mistake. Are you sure that
the denominator means all the possible n-grams (in particular all the
possible bi- and higher order n-grams)? So far I've only seen such a
strategy for the calculation of the unigram probabilities.

Kind regards,
Vladimir

2012/6/26 Michele Filannino <michele.filannino at cs.manchester.ac.uk>:
> Hi Coen,
>
> look carefully to the denominator of that formula (page 370). You will
> easily spot that it refers to all the possible N-grams in your corpus, not
> just that one constrained according a particular w_i. The meaning of that
> denominator is just a counter of all the possible N-grams present in your
> corpus. If you correctly interpret that counter you will easily understand
> that your denominator cannot be equal to 0 (except in the case you do not
> have any N-gram in it).
>
> Let me know if you have understood your mistake.
>
> Bye,
> michele.
>
> On Thu, Jun 21, 2012 at 3:20 PM, Coen Jonker <coen.j.jonker at gmail.com>
> wrote:
>>
>> Dear readers of the corpora list,
>>
>>
>> As a part of the AI-master course handwriting recognition I am working on
>> the implementation of a Statistical Language Model for 19th century Dutch. I
>> am running into a problem and hope you may be able to help. I have already
>> spoken with prof. Ernst Wit and he suggested I contacted you. I would be
>> very grateful if you could help me along.
>>
>> The purpose of the statistical language model is to provide a
>> knowledge-based estimation for the conditional probability of a word w given
>> the history h (previous words), let this probability be P(w|h).
>>
>> Since the available corpus for this project is quite sparse I want to use
>> statistical smoothing on the conditional probabilities. I have learned that
>> using a simple maximum likelihood estimation for P(w|h) will yield zero
>> probabilities for word sequences that are not in the corpus, even though
>> many grammatically correct sequences are not in the corpus. Furthermore, the
>> actual probabilities for P(w|h) will be overestimated by maximum likelihood.
>>
>> There are many smoothing techniques available, but empirically a modified
>> form of Kneser-Ney smoothing has been proven very effective (I have attached
>> a paper by Stanley Chen and Joshua Goodman explaining this). A quick intro
>> on the topic is on: http://www.youtube.com/watch?v=ody1ysUTD7o
>>
>> The Kneser-Ney smoothing interpolates discounted probabilities for
>> trigrams with lower order bigram probabilities. The equations on page 12
>> (370 in the journal numbering) of the attached PDF are the ones I use. The
>> problem I run into is that the denominator of the fraction, which is the
>> count of the history h in the corpus may be zero, yielding errors, but also
>> making the gamma-term zero, yielding zero-probabilities. Avoiding zero
>> probabilities was one of the reasons to implement smoothing in the first
>> place.
>>
>> This problem has frustrated me for a few weeks now, after reading most of
>> the available literature on the topic I am afraid that my knowledge of
>> language modeling or statistics may be insufficient or that I misunderstood
>> a fundamental part of the technique.
>>
>> Did I misunderstand anything? I sincerely hope you are able to point me in
>> the direction of a solution.
>>
>> Sincerely,
>>
>> Coen Jonker
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
>
>
> --
> Michele Filannino
>
> CDT PhD student in Computer Science
> Room IT301 - IT Building
> The University of Manchester
> filannim at cs.manchester.ac.uk
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list