Corpora: Frequency Meaning
eric at scs.leeds.ac.uk
eric at scs.leeds.ac.uk
Thu Feb 17 10:14:45 UTC 2000
Pascual,
one point to remember is Zipf's law of frequency distribution
of countable things in language. You may need to use a logarithmic scale
in classifying into low/medium/high frequency. For example, many years ago
I worked on the wordlist and suffixlist used in the LOB Corpus tagging program,
classifying word-tags with words and suffixes on a logarithmic scale:
POS-tags were classified common/rare/very-rare, where "rare" meant less
than 10%, "very rare" meant 1% or less,
eg water NN VB@ means "water" is usually Noun, about 10% Verb
You need huge data samples to yield frequencies accurate enough to give
more fine-grained distinctions - I would advise against as many as 5 levels
Very Low/Low/Moderate/High/Very High unless you are confident you can get
enough examples to classify with confidence.
Eric
Eric Atwell, Distributed Multimedia Systems MSc Tutor & SOCRATES Coordinator
Centre for Computer Analysis of Language And Speech (CCALAS)
School of Computer Studies, Faculty of Engineering,
University of Leeds, LEEDS LS2 9JT, England
EMAIL: eric at scs.leeds.ac.uk TEL: (44)113-2335430 FAX: (44)113-2335468
WWW: http://www.scs.leeds.ac.uk/eric
More information about the Corpora
mailing list