[Corpora-List] threshold for hapax legomena
Eric Atwell
eric at comp.leeds.ac.uk
Fri May 25 07:54:21 UTC 2007
I like the deifinition given in Wikipedia:
http://en.wikipedia.org/wiki/Hapax_legomena
"A hapax legomenon (pl. hapax legomena, though sometimes called hapaxes
for short) is a word which occurs only once in the written record of a
language, in the works of an author, or in a single text. If a word is
used twice it is a dis legomenon, thrice, a tris legomenon. Beyond
tetrakis legomenon (four times), a word is not rare enough to note."
So words with "below 10 occurrences" may not be "important/worthwhile"
but they are not hapax legomena.
Furthermore, I don't see how there can be a single statistical metric
whcvih tells you whether a cut-off of 10 (or other cut-off) is sensible
regardless of application. For some applications, you clearly only want
medium-to-high-frequency words (eg comparing national varieties of
Englishes); for others you may want to include even very rare words
(eg building a comprehensive lexicon for a PoS-tagger)
Eric Atwell, University of Leeds
On Fri, 25 May 2007, TadPiotr wrote:
> Dear All,
> Is there any reliable way of calculating the threshold for hapax legomena in
> a corpus of a given size? By a threshold I mean the number of occurrences
> (tokens) below which any types will be treated as statistically
> insignificant. There is a common belief that below 10 occurrences what we
> have is hapax legomena, but that should differ with respect to the size of
> the corpus, it seems to me, and that should be calculable.
> Any help will be appreciated.
> Yours,
> Tadeusz Piotrowski
> Professor in Linguistics
> Opole University
> Poland
>
>
More information about the Corpora
mailing list