[Corpora-List] threshold for hapax legomena

Khurshid Ahmad kahmad at cs.tcd.ie
Fri May 25 11:24:21 UTC 2007


Dear Tadeusz
I am not sure if there is something under/over pedantic: a pedant is just
that- pedantic.

Prof Sampson is correct and I think we should try and stick to a single
definition.  As pointed out in the Wikipedia, a nonce word is not recorded
and hence it is not a hapax; hapax legomena was used as a signature in the
verification of New Testament texts (Wikipedia again).

If your question relates to the problem of how we can define a threshold
frequency for rarity then this will have to be a relative measure.  In my
own work I have used z-scores for frequency and a normalised measure
related to tfidf (term-frequency/inverse document frequency), which I call
weirdness, to have a rarity threshold.  If both z-scores are greater than
zero, then the token can be considered as a candidate term.  This joint
condition helps to ignore low-frequency tokens (usually spelling errors of
accepted tokens) in very few texts in a corpus.

Oliver's point about compound tokens is important.  And here Frank
Smadja's selection restraint can be very helpful.  Frank uses collocation
frequency of a token with all,other tokens in a window of five near
neighbours.  He then computes a histogram and has statistic to compute
which of the bins in the histogram is significant.  This statistic and
z-score of frequency then helps you to select compound terms.

Then of course there is the lexicographer's intuition - and my friend
Ramesh has always the right intuition (see notes on the construction of
Co-Build below) - refreshes the memory of the boss (J.Sinclair)

Best wishes
> Hi Tadeusz
>
> Although most corpus software programs offer thresholds, and most corpus
> users
> apply them, one of the significant problems arises if you are working with
> 'words': individual lemma forms may lie below the threshold, but
> added together, may
> become a lemma above the threshold; similarly, when manually
> investigated, various
> lexical sets (eg semantic fields) can be discovered below the threshold...
>
> a) So threshold partly depends on the purpose of the corpus search.
>
> b) You state '10 occurrences' without indicating size of corpus.
> Surely any concept of threshold must be relative to corpus size?
> 10 occurrences in a corpus of 1m words would be rather high,
> whereas in a corpus of 500m words, it would be rather low.
>
> c) For approximate figures (from memory), at Cobuild I think we used
> a threshold of 3 occurrences in the 7.3m word corpus, 10 occurrences
> in the 167m corpus, and 30 in the 323m corpus. But lexicographers
> could always argue the case with their editors for 'below threshold
> items' to be
> included in the draft dictionary (or indeed for 'just above
> threshold' items to be excluded,
> eg if they were very restricted in distribution, eg to one or two
> source texts).
>
> d) I remember 'panhandle' being a problem in c 1994: no individual
> form, or wordclass, or
> meaning, was frequent enough, but adding them all together raised it above
> the
> 'threshold'; but the problem then was that the evidence for each
> dictionary element was minimal, and
> caused problems in writing definitions, assessing grammar "patterns"
> (how many occurrences
> do you need to constitute a reliable pattern?), style labels, etc
>
> Best
> Ramesh
>
> At 08:23 25/05/2007, TadPiotr wrote:
>>Dear All,
>>Is there any reliable way of calculating the threshold for hapax legomena
>> in
>>a corpus of a given size? By a threshold I mean the number of occurrences
>>(tokens) below which any types will be treated as statistically
>>insignificant. There is a common belief that below 10 occurrences what we
>>have is hapax legomena, but that should differ with respect to the size
>> of
>>the corpus, it seems to me, and that should be calculable.
>>Any help will be appreciated.
>>Yours,
>>Tadeusz Piotrowski
>>Professor in Linguistics
>>Opole University
>>Poland
>
> Ramesh Krishnamurthy
> Lecturer in English Studies, School of Languages and Social Sciences,
> Aston University, Birmingham B4 7ET, UK
> Tel: +44 (0)121-204-3812 ; Fax: +44 (0)121-204-3766 [Room NX08, 10th
> Floor, North Wing of Main Building]
> http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp
> Project Leader, ACORN (Aston Corpus Network): http://corpus.aston.ac.uk/
>
>
>


Khurshid Ahmad

Professor of Computer Science
Department of Computer Science
Trinity College,
DUBLIN-2
IRELAND
Phone 00 353 1 896 8429

Web Page: http://people.tcd.ie/kahmad



More information about the Corpora mailing list