[Corpora-List] threshold for hapax legomena

TadPiotr tadpiotr at plusnet.pl
Fri May 25 07:23:45 UTC 2007


Dear All,
Is there any reliable way of calculating the threshold for hapax legomena in
a corpus of a given size? By a threshold I mean the number of occurrences
(tokens) below which any types will be treated as statistically
insignificant. There is a common belief that below 10 occurrences what we
have is hapax legomena, but that should differ with respect to the size of
the corpus, it seems to me, and that should be calculable.
Any help will be appreciated.
Yours,
Tadeusz Piotrowski
Professor in Linguistics
Opole University
Poland



More information about the Corpora mailing list