[Corpora-List] threshold for hapax legomena

Ramesh Krishnamurthy r.krishnamurthy at aston.ac.uk
Fri May 25 09:02:54 UTC 2007


Hi Tadeusz

Although most corpus software programs offer thresholds, and most corpus users
apply them, one of the significant problems arises if you are working with
'words': individual lemma forms may lie below the threshold, but 
added together, may
become a lemma above the threshold; similarly, when manually 
investigated, various
lexical sets (eg semantic fields) can be discovered below the threshold...

a) So threshold partly depends on the purpose of the corpus search.

b) You state '10 occurrences' without indicating size of corpus.
Surely any concept of threshold must be relative to corpus size?
10 occurrences in a corpus of 1m words would be rather high,
whereas in a corpus of 500m words, it would be rather low.

c) For approximate figures (from memory), at Cobuild I think we used
a threshold of 3 occurrences in the 7.3m word corpus, 10 occurrences
in the 167m corpus, and 30 in the 323m corpus. But lexicographers
could always argue the case with their editors for 'below threshold 
items' to be
included in the draft dictionary (or indeed for 'just above 
threshold' items to be excluded,
eg if they were very restricted in distribution, eg to one or two 
source texts).

d) I remember 'panhandle' being a problem in c 1994: no individual 
form, or wordclass, or
meaning, was frequent enough, but adding them all together raised it above the
'threshold'; but the problem then was that the evidence for each 
dictionary element was minimal, and
caused problems in writing definitions, assessing grammar "patterns" 
(how many occurrences
do you need to constitute a reliable pattern?), style labels, etc

Best
Ramesh

At 08:23 25/05/2007, TadPiotr wrote:
>Dear All,
>Is there any reliable way of calculating the threshold for hapax legomena in
>a corpus of a given size? By a threshold I mean the number of occurrences
>(tokens) below which any types will be treated as statistically
>insignificant. There is a common belief that below 10 occurrences what we
>have is hapax legomena, but that should differ with respect to the size of
>the corpus, it seems to me, and that should be calculable.
>Any help will be appreciated.
>Yours,
>Tadeusz Piotrowski
>Professor in Linguistics
>Opole University
>Poland

Ramesh Krishnamurthy
Lecturer in English Studies, School of Languages and Social Sciences, 
Aston University, Birmingham B4 7ET, UK
Tel: +44 (0)121-204-3812 ; Fax: +44 (0)121-204-3766 [Room NX08, 10th 
Floor, North Wing of Main Building]
http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp
Project Leader, ACORN (Aston Corpus Network): http://corpus.aston.ac.uk/ 



More information about the Corpora mailing list