[Corpora-List] threshold for hapax legomena
Ramesh Krishnamurthy
r.krishnamurthy at aston.ac.uk
Fri May 25 09:02:54 UTC 2007
Hi Tadeusz
Although most corpus software programs offer thresholds, and most corpus users
apply them, one of the significant problems arises if you are working with
'words': individual lemma forms may lie below the threshold, but
added together, may
become a lemma above the threshold; similarly, when manually
investigated, various
lexical sets (eg semantic fields) can be discovered below the threshold...
a) So threshold partly depends on the purpose of the corpus search.
b) You state '10 occurrences' without indicating size of corpus.
Surely any concept of threshold must be relative to corpus size?
10 occurrences in a corpus of 1m words would be rather high,
whereas in a corpus of 500m words, it would be rather low.
c) For approximate figures (from memory), at Cobuild I think we used
a threshold of 3 occurrences in the 7.3m word corpus, 10 occurrences
in the 167m corpus, and 30 in the 323m corpus. But lexicographers
could always argue the case with their editors for 'below threshold
items' to be
included in the draft dictionary (or indeed for 'just above
threshold' items to be excluded,
eg if they were very restricted in distribution, eg to one or two
source texts).
d) I remember 'panhandle' being a problem in c 1994: no individual
form, or wordclass, or
meaning, was frequent enough, but adding them all together raised it above the
'threshold'; but the problem then was that the evidence for each
dictionary element was minimal, and
caused problems in writing definitions, assessing grammar "patterns"
(how many occurrences
do you need to constitute a reliable pattern?), style labels, etc
Best
Ramesh
At 08:23 25/05/2007, TadPiotr wrote:
>Dear All,
>Is there any reliable way of calculating the threshold for hapax legomena in
>a corpus of a given size? By a threshold I mean the number of occurrences
>(tokens) below which any types will be treated as statistically
>insignificant. There is a common belief that below 10 occurrences what we
>have is hapax legomena, but that should differ with respect to the size of
>the corpus, it seems to me, and that should be calculable.
>Any help will be appreciated.
>Yours,
>Tadeusz Piotrowski
>Professor in Linguistics
>Opole University
>Poland
Ramesh Krishnamurthy
Lecturer in English Studies, School of Languages and Social Sciences,
Aston University, Birmingham B4 7ET, UK
Tel: +44 (0)121-204-3812 ; Fax: +44 (0)121-204-3766 [Room NX08, 10th
Floor, North Wing of Main Building]
http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp
Project Leader, ACORN (Aston Corpus Network): http://corpus.aston.ac.uk/
More information about the Corpora
mailing list