[Corpora-List] threshold for hapax legomena
Przemek Kaszubski
przemka at amu.edu.pl
Fri May 25 16:20:13 UTC 2007
Guy Aston's words, sort of intuition-based, but his intuitions I often
lend credence to:
"...advanced learners [...] should consider any word occurring more than
once every million words of running text (i.e. more than 100 times in
the 100-million word BNC) to be “worth learning”. This is a very rough
rule of thumb: for instance, if a word only occurs in a particular type
of texts, it may be less - or more - important, depending on whether
that type of texts is important to them. For multi-word phrases, the
“worth learning” threshold, as we shall see, needs to be set lower." (p.
133)
Aston, Guy. 2002. "Getting one's teeth into a corpus", in: Melinda Tan
(ed.), Corpus studies in language education, IELE Press. 131-143.
Pre-pub or web edition: http://sslmit.unibo.it/~guy/tan.htm
Regards,
Przemek
TadPiotr wrote (2007-05-25 09:23):
> Dear All,
> Is there any reliable way of calculating the threshold for hapax legomena in
> a corpus of a given size? By a threshold I mean the number of occurrences
> (tokens) below which any types will be treated as statistically
> insignificant. There is a common belief that below 10 occurrences what we
> have is hapax legomena, but that should differ with respect to the size of
> the corpus, it seems to me, and that should be calculable.
> Any help will be appreciated.
> Yours,
> Tadeusz Piotrowski
> Professor in Linguistics
> Opole University
> Poland
>
>
>
>
--
Dr Przemyslaw Kaszubski
+48 61 8293515
PICLE EAP LEARNER CORPUS ONLINE:
http://www.staff.amu.edu.pl/~przemka/picle.html
CORPUS LINGUISTICS BIBLIOGRAPHY:
http://www.staff.amu.edu.pl/~przemka
MY CORPUS LINGUISTICS SEMINARS:
http://www.staff.amu.edu.pl/~przemka/seminars.htm
EAP WRITING PAGE (IFA FULL-TIME PROGRAMME):
http://www.staff.amu.edu.pl/~przemka/IFA_writing
=======================================
School of English (IFA)
Adam Mickiewicz University
http://ifa.amu.edu.pl
=======================================
More information about the Corpora
mailing list