[Corpora-List] Question about smoothing of

Michele Filannino michele.filannino at cs.manchester.ac.uk
Tue Jun 26 10:02:44 UTC 2012


Yes I am sure. I also developed this method in a Python script. If you want
I could give it to you.
The idea is normalising the number of constrained N-grams by the number of
all possible N-grams within your corpus.
In other words, the denominator do not depend at all by the input of the
function. So, you could compute it one time for all. :)

Let me know.

Bye,
michele.

On Tue, Jun 26, 2012 at 8:55 AM, Wladimir Sidorenko
<wlsidorenko at gmail.com>wrote:

> Hi Michele,
>
> Could you please explain that for me - the denominator of the 1-st
> addend is the sum of the counts of all c(w_{i-n+1}^{i}) over all w_{i}
> - I understood that as a number of times, the context w_{i-n+1}
> preceded any word in training corpus and not the overall number of all
> the n-grams in the corpus at all. To my mind it would is more
> reasonable, since if I estimate the probability of the tag pair `VB
> TO', I usually divide the number of times I saw the pair `VB TO' by
> the number of times I saw the tag `VB,' and not by the number of all
> the bigrams I've seen. But maybe that's my mistake. Are you sure that
> the denominator means all the possible n-grams (in particular all the
> possible bi- and higher order n-grams)? So far I've only seen such a
> strategy for the calculation of the unigram probabilities.
>
> Kind regards,
> Vladimir
>
> 2012/6/26 Michele Filannino <michele.filannino at cs.manchester.ac.uk>:
> > Hi Coen,
> >
> > look carefully to the denominator of that formula (page 370). You will
> > easily spot that it refers to all the possible N-grams in your corpus,
> not
> > just that one constrained according a particular w_i. The meaning of that
> > denominator is just a counter of all the possible N-grams present in your
> > corpus. If you correctly interpret that counter you will easily
> understand
> > that your denominator cannot be equal to 0 (except in the case you do not
> > have any N-gram in it).
> >
> > Let me know if you have understood your mistake.
> >
> > Bye,
> > michele.
> >
> > On Thu, Jun 21, 2012 at 3:20 PM, Coen Jonker <coen.j.jonker at gmail.com>
> > wrote:
> >>
> >> Dear readers of the corpora list,
> >>
> >>
> >> As a part of the AI-master course handwriting recognition I am working
> on
> >> the implementation of a Statistical Language Model for 19th century
> Dutch. I
> >> am running into a problem and hope you may be able to help. I have
> already
> >> spoken with prof. Ernst Wit and he suggested I contacted you. I would be
> >> very grateful if you could help me along.
> >>
> >> The purpose of the statistical language model is to provide a
> >> knowledge-based estimation for the conditional probability of a word w
> given
> >> the history h (previous words), let this probability be P(w|h).
> >>
> >> Since the available corpus for this project is quite sparse I want to
> use
> >> statistical smoothing on the conditional probabilities. I have learned
> that
> >> using a simple maximum likelihood estimation for P(w|h) will yield zero
> >> probabilities for word sequences that are not in the corpus, even though
> >> many grammatically correct sequences are not in the corpus.
> Furthermore, the
> >> actual probabilities for P(w|h) will be overestimated by maximum
> likelihood.
> >>
> >> There are many smoothing techniques available, but empirically a
> modified
> >> form of Kneser-Ney smoothing has been proven very effective (I have
> attached
> >> a paper by Stanley Chen and Joshua Goodman explaining this). A quick
> intro
> >> on the topic is on: http://www.youtube.com/watch?v=ody1ysUTD7o
> >>
> >> The Kneser-Ney smoothing interpolates discounted probabilities for
> >> trigrams with lower order bigram probabilities. The equations on page 12
> >> (370 in the journal numbering) of the attached PDF are the ones I use.
> The
> >> problem I run into is that the denominator of the fraction, which is the
> >> count of the history h in the corpus may be zero, yielding errors, but
> also
> >> making the gamma-term zero, yielding zero-probabilities. Avoiding zero
> >> probabilities was one of the reasons to implement smoothing in the first
> >> place.
> >>
> >> This problem has frustrated me for a few weeks now, after reading most
> of
> >> the available literature on the topic I am afraid that my knowledge of
> >> language modeling or statistics may be insufficient or that I
> misunderstood
> >> a fundamental part of the technique.
> >>
> >> Did I misunderstand anything? I sincerely hope you are able to point me
> in
> >> the direction of a solution.
> >>
> >> Sincerely,
> >>
> >> Coen Jonker
> >>
> >> _______________________________________________
> >> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> >> Corpora mailing list
> >> Corpora at uib.no
> >> http://mailman.uib.no/listinfo/corpora
> >>
> >
> >
> >
> > --
> > Michele Filannino
> >
> > CDT PhD student in Computer Science
> > Room IT301 - IT Building
> > The University of Manchester
> > filannim at cs.manchester.ac.uk
> >
> > _______________________________________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
> >
>



-- 
Michele Filannino

CDT PhD student in Computer Science
Room IT301 - IT Building
The University of Manchester
filannim at cs.manchester.ac.uk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120626/88c506e3/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list