[Corpora-List] threshold for hapax legomena
TadPiotr
tadpiotr at plusnet.pl
Fri May 25 08:01:21 UTC 2007
Sorry, I was using the term hapax legomena in the loose meaning: rare items,
which I should not have done.
Tadeusz Piotrowski
> -----Original Message-----
> From: Eric Atwell [mailto:eric at comp.leeds.ac.uk]
> Sent: Friday, May 25, 2007 9:54 AM
> To: TadPiotr
> Cc: corpora at hd.uib.no
> Subject: Re: [Corpora-List] threshold for hapax legomena
>
> I like the deifinition given in Wikipedia:
> http://en.wikipedia.org/wiki/Hapax_legomena
>
> "A hapax legomenon (pl. hapax legomena, though sometimes
> called hapaxes for short) is a word which occurs only once in
> the written record of a language, in the works of an author,
> or in a single text. If a word is used twice it is a dis
> legomenon, thrice, a tris legomenon. Beyond tetrakis
> legomenon (four times), a word is not rare enough to note."
>
> So words with "below 10 occurrences" may not be
> "important/worthwhile"
> but they are not hapax legomena.
>
> Furthermore, I don't see how there can be a single
> statistical metric whcvih tells you whether a cut-off of 10
> (or other cut-off) is sensible regardless of application.
> For some applications, you clearly only want
> medium-to-high-frequency words (eg comparing national
> varieties of Englishes); for others you may want to include
> even very rare words (eg building a comprehensive lexicon for
> a PoS-tagger)
>
> Eric Atwell, University of Leeds
>
> On Fri, 25 May 2007, TadPiotr wrote:
>
> > Dear All,
> > Is there any reliable way of calculating the threshold for hapax
> > legomena in a corpus of a given size? By a threshold I mean
> the number
> > of occurrences
> > (tokens) below which any types will be treated as statistically
> > insignificant. There is a common belief that below 10
> occurrences what
> > we have is hapax legomena, but that should differ with
> respect to the
> > size of the corpus, it seems to me, and that should be calculable.
> > Any help will be appreciated.
> > Yours,
> > Tadeusz Piotrowski
> > Professor in Linguistics
> > Opole University
> > Poland
> >
> >
>
>
More information about the Corpora
mailing list