[Corpora-List] threshold for hapax legomena

Fri May 25 08:18:37 UTC 2007

Dear Tadeusz,

I cannot claim that what I have to offer is a reliable way, but it is at 
least a more informed way of calculating thresholds.

In my PhD work on Text-Induced Spelling Correction (TISC) I have 
experimented with what I call 'Zipf filters'. These set the threshold 
based on observed occurrences within the particular corpus per word 
length. The simple observation being that one should expect to encounter 
the shorter words far more often than the longer ones.

I believe I agree with what Eric writes about the single statistical 
metric. The Zipf filters allow you to set the thresholds higher or lower 
depending on the application and also allow you to either opt for higher 
precision or higher recall.

My dissertation is online at: http://ilk.uvt.nl/~mre/

I also have a Perl implementation of this in a script called CICCL 
(Corpus-Induced Corpus Clean-up) which I have recently put under the GNU 
Public Licence.

Yours,

Martin Reynaert
Postdoc Researcher
Induction of Linguistic Knowledge
Tilburg University
The Netherlands

TadPiotr wrote:
> Dear All,
> Is there any reliable way of calculating the threshold for hapax legomena in
> a corpus of a given size? By a threshold I mean the number of occurrences
> (tokens) below which any types will be treated as statistically
> insignificant. There is a common belief that below 10 occurrences what we
> have is hapax legomena, but that should differ with respect to the size of
> the corpus, it seems to me, and that should be calculable.
> Any help will be appreciated.
> Yours,
> Tadeusz Piotrowski
> Professor in Linguistics
> Opole University
> Poland
> 
> 
>