[Corpora-List] threshold for hapax legomena

Oliver Mason O.Mason at bham.ac.uk
Fri May 25 08:43:31 UTC 2007


I would agree that all thresholds are somewhat arbitrary and mainly
serve the two purposes mentioned: reducing computational load and
presentation to users.

In my own work I have recently used threshold of two kinds for these purposes:

- a percentage filter, where I truncate a list after the item with a
frequency that is below a certain percentage of the frequency of the
most common item.  For example, if in a list of multi-word units the
most frequent one has a frequency of 1200, then the 10% filter would
cut off at frequency 120, the 5% one at 60, and the 1% one at 12.  The
choice of which of the three filters to use is of course arbitrary, a
bit like the p-value in statistical significance testing I guess.

- for computing collocations I use THRESHOLD =
\sqrt{\frac{N}{1,000,000}}, ie the square root of the corpus size
divided by 1 million.  This yields a 'dynamic' value which reflects
that smaller corpora will need lower thresholds than larger ones.  For
a 1 million word corpus the threshold would be at 1, for 100 million
at 10.

Both of these I use purely for practical reasons, and they are indeed
more or less randomly selected, though motivated by general
observations on how language behaves.

Oliver



More information about the Corpora mailing list