[Corpora-List] threshold for hapax legomena

TadPiotr tadpiotr at plusnet.pl
Fri May 25 08:01:21 UTC 2007


Sorry, I was using the term hapax legomena in the loose meaning: rare items,
which I should not have done.
Tadeusz Piotrowski 

> -----Original Message-----
> From: Eric Atwell [mailto:eric at comp.leeds.ac.uk] 
> Sent: Friday, May 25, 2007 9:54 AM
> To: TadPiotr
> Cc: corpora at hd.uib.no
> Subject: Re: [Corpora-List] threshold for hapax legomena
> 
> I like the deifinition given in Wikipedia:
> http://en.wikipedia.org/wiki/Hapax_legomena
> 
> "A hapax legomenon (pl. hapax legomena, though sometimes 
> called hapaxes for short) is a word which occurs only once in 
> the written record of a language, in the works of an author, 
> or in a single text. If a word is used twice it is a dis 
> legomenon, thrice, a tris legomenon. Beyond tetrakis 
> legomenon (four times), a word is not rare enough to note."
> 
> So words with "below 10 occurrences" may not be 
> "important/worthwhile" 
> but they are not hapax legomena.
> 
> Furthermore, I don't see how there can be a single 
> statistical metric whcvih tells you whether a cut-off of 10 
> (or other cut-off) is sensible regardless of application.  
> For some applications, you clearly only want 
> medium-to-high-frequency words (eg comparing national 
> varieties of Englishes); for others you may want to include 
> even very rare words (eg building a comprehensive lexicon for 
> a PoS-tagger)
> 
> Eric Atwell, University of Leeds
> 
> On Fri, 25 May 2007, TadPiotr wrote:
> 
> > Dear All,
> > Is there any reliable way of calculating the threshold for hapax 
> > legomena in a corpus of a given size? By a threshold I mean 
> the number 
> > of occurrences
> > (tokens) below which any types will be treated as statistically 
> > insignificant. There is a common belief that below 10 
> occurrences what 
> > we have is hapax legomena, but that should differ with 
> respect to the 
> > size of the corpus, it seems to me, and that should be calculable.
> > Any help will be appreciated.
> > Yours,
> > Tadeusz Piotrowski
> > Professor in Linguistics
> > Opole University
> > Poland
> >
> >
> 
> 



More information about the Corpora mailing list