[Corpora-List] threshold for hapax legomena
Martin Reynaert
reynaert at uvt.nl
Fri May 25 08:18:37 UTC 2007
Dear Tadeusz,
I cannot claim that what I have to offer is a reliable way, but it is at
least a more informed way of calculating thresholds.
In my PhD work on Text-Induced Spelling Correction (TISC) I have
experimented with what I call 'Zipf filters'. These set the threshold
based on observed occurrences within the particular corpus per word
length. The simple observation being that one should expect to encounter
the shorter words far more often than the longer ones.
I believe I agree with what Eric writes about the single statistical
metric. The Zipf filters allow you to set the thresholds higher or lower
depending on the application and also allow you to either opt for higher
precision or higher recall.
My dissertation is online at: http://ilk.uvt.nl/~mre/
I also have a Perl implementation of this in a script called CICCL
(Corpus-Induced Corpus Clean-up) which I have recently put under the GNU
Public Licence.
Yours,
Martin Reynaert
Postdoc Researcher
Induction of Linguistic Knowledge
Tilburg University
The Netherlands
TadPiotr wrote:
> Dear All,
> Is there any reliable way of calculating the threshold for hapax legomena in
> a corpus of a given size? By a threshold I mean the number of occurrences
> (tokens) below which any types will be treated as statistically
> insignificant. There is a common belief that below 10 occurrences what we
> have is hapax legomena, but that should differ with respect to the size of
> the corpus, it seems to me, and that should be calculable.
> Any help will be appreciated.
> Yours,
> Tadeusz Piotrowski
> Professor in Linguistics
> Opole University
> Poland
>
>
>
More information about the Corpora
mailing list