[Corpora-List] threshold for hapax legomena

Przemek Kaszubski przemka at amu.edu.pl
Fri May 25 16:20:13 UTC 2007


Guy Aston's words, sort of intuition-based, but his intuitions I often 
lend credence to:

"...advanced learners [...] should consider any word occurring more than 
once every million words of running text (i.e. more than 100 times in 
the 100-million word BNC) to be “worth learning”. This is a very rough 
rule of thumb: for instance, if a word only occurs in a particular type 
of texts, it may be less - or more - important, depending on whether 
that type of texts is important to them. For multi-word phrases, the 
“worth learning” threshold, as we shall see, needs to be set lower." (p. 
133)

Aston, Guy. 2002. "Getting one's teeth into a corpus", in: Melinda Tan 
(ed.), Corpus studies in language education, IELE Press. 131-143. 
Pre-pub or web edition: http://sslmit.unibo.it/~guy/tan.htm

Regards,

Przemek


TadPiotr wrote (2007-05-25 09:23):
> Dear All,
> Is there any reliable way of calculating the threshold for hapax legomena in
> a corpus of a given size? By a threshold I mean the number of occurrences
> (tokens) below which any types will be treated as statistically
> insignificant. There is a common belief that below 10 occurrences what we
> have is hapax legomena, but that should differ with respect to the size of
> the corpus, it seems to me, and that should be calculable.
> Any help will be appreciated.
> Yours,
> Tadeusz Piotrowski
> Professor in Linguistics
> Opole University
> Poland
>
>
>
>   

-- 
Dr Przemyslaw Kaszubski
+48 61 8293515

PICLE EAP LEARNER CORPUS ONLINE:
http://www.staff.amu.edu.pl/~przemka/picle.html

CORPUS LINGUISTICS BIBLIOGRAPHY:
http://www.staff.amu.edu.pl/~przemka

MY CORPUS LINGUISTICS SEMINARS:
http://www.staff.amu.edu.pl/~przemka/seminars.htm

EAP WRITING PAGE (IFA FULL-TIME PROGRAMME):
http://www.staff.amu.edu.pl/~przemka/IFA_writing

=======================================
School of English (IFA)
Adam Mickiewicz University
http://ifa.amu.edu.pl
=======================================



More information about the Corpora mailing list