[Corpora-List] Question about smoothing of

Thu Jun 21 14:20:31 UTC 2012

Dear readers of the corpora list,

As a part of the AI-master course handwriting recognition I am working on
the implementation of a Statistical Language Model for 19th century Dutch.
I am running into a problem and hope you may be able to help. I have
already spoken with prof. Ernst Wit and he suggested I contacted you. I
would be very grateful if you could help me along.

The purpose of the statistical language model is to provide a
knowledge-based estimation for the conditional probability of a word w
given the history h (previous words), let this probability be P(w|h).

Since the available corpus for this project is quite sparse I want to use
statistical smoothing on the conditional probabilities. I have learned that
using a simple maximum likelihood estimation for P(w|h) will yield zero
probabilities for word sequences that are not in the corpus, even though
many grammatically correct sequences are not in the corpus. Furthermore,
the actual probabilities for P(w|h) will be overestimated by maximum
likelihood.

There are many smoothing techniques available, but empirically a modified
form of Kneser-Ney smoothing has been proven very effective (I have
attached a paper by Stanley Chen and Joshua Goodman explaining this). A
quick intro on the topic is on: http://www.youtube.com/watch?v=ody1ysUTD7o

The Kneser-Ney smoothing interpolates discounted probabilities for trigrams
with lower order bigram probabilities. The equations on page 12 (370 in the
journal numbering) of the attached PDF are the ones I use. The problem I
run into is that the denominator of the fraction, which is the count of the
history h in the corpus may be zero, yielding errors, but also making the
gamma-term zero, yielding zero-probabilities. Avoiding zero probabilities
was one of the reasons to implement smoothing in the first place.

This problem has frustrated me for a few weeks now, after reading most of
the available literature on the topic I am afraid that my knowledge of
language modeling or statistics may be insufficient or that I misunderstood
a fundamental part of the technique.

Did I misunderstand anything? I sincerely hope you are able to point me in
the direction of a solution.

Sincerely,

Coen Jonker
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120621/00971ba2/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: chen-goodman-99.pdf
Type: application/pdf
Size: 656283 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120621/00971ba2/attachment-0001.pdf>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora