Corpora: Frequency Meaning

eric at scs.leeds.ac.uk eric at scs.leeds.ac.uk
Thu Feb 17 10:14:45 UTC 2000


Pascual,
one point to remember is Zipf's law of frequency distribution
of countable things in language.  You may need to use a logarithmic scale
in classifying into low/medium/high frequency.  For example, many years ago
I worked on the wordlist and suffixlist used in the LOB Corpus tagging program,
classifying word-tags with words and suffixes on a logarithmic scale:
POS-tags were classified common/rare/very-rare, where "rare" meant less
than 10%, "very rare" meant 1% or less,
eg water NN VB@  means "water" is usually Noun, about 10% Verb

You need huge data samples to yield frequencies accurate enough to give
more fine-grained distinctions - I would advise against as many as 5 levels
Very Low/Low/Moderate/High/Very High unless you are confident you can get
enough examples to classify with confidence.

Eric

Eric Atwell, Distributed Multimedia Systems MSc Tutor & SOCRATES Coordinator
 Centre for Computer Analysis of Language And Speech (CCALAS)
 School of Computer Studies, Faculty of Engineering,
 University of Leeds, LEEDS LS2 9JT, England
 EMAIL: eric at scs.leeds.ac.uk  TEL: (44)113-2335430  FAX: (44)113-2335468
 WWW: http://www.scs.leeds.ac.uk/eric



More information about the Corpora mailing list