[Corpora-List] threshold for hapax legomena

Adam Kilgarriff adam at lexmasterclass.com
Fri May 25 08:07:57 UTC 2007


Tadeusz,

It all depends what you want to do.  I like the title of a paper by
Daelemans, Van der Bosch and Zavrel: "Forgetting is harmful in language
learning" - the forgetting in question is the forgetting of rare occurrences
(eg hapaxes, in your terminology) for machine learning.  For many ML
purposes, *any* threshold is a bad idea, as a range of research by that team
and others (eg Bod) demonstrates.  (Thresholds are appealing as they reduce
the computational scale of the task, so lots of people have tried to use
them to make tasks more manageable: however performance has usually
suffered.)

For human users, it depends mainly on how much data the user wants to see.
In the Sketch Engine we are about to revise the default threshold for what
collocations to show to the user.  Our users generally want a simple report
that they can view without scrolling.  We will revise the default threshold
so that it is a function of corpus size and also of headword frequency, in
order that the user who only wants to see one html page (with no scrolling)
gets a maximally informative overview of the headword's collocational
behaviour.

Best

Adam
http://www.kilgarriff.co.uk 
http://www.sketchengine.co.uk 


-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of TadPiotr
Sent: 25 May 2007 08:24
To: corpora at hd.uib.no
Subject: [Corpora-List] threshold for hapax legomena

Dear All,
Is there any reliable way of calculating the threshold for hapax legomena in
a corpus of a given size? By a threshold I mean the number of occurrences
(tokens) below which any types will be treated as statistically
insignificant. There is a common belief that below 10 occurrences what we
have is hapax legomena, but that should differ with respect to the size of
the corpus, it seems to me, and that should be calculable.
Any help will be appreciated.
Yours,
Tadeusz Piotrowski
Professor in Linguistics
Opole University
Poland



More information about the Corpora mailing list