[Corpora-List] [Corpora List] Absolute Frequencies
Martin Weisser
weissermar at gmail.com
Wed Feb 26 04:37:10 UTC 2014
Hi Cédric,
I guess the reason you've been advised to normalise to something
different than 1 million is that you may accidentally over-exaggerate
the potential frequency of occurrence if you corpora are smaller than a
million words/tokens. This is because, at least to my knowledge, the
standard procedure for normalising is simply to calculate a relative
frequency and then multiplying this by your normalisation factor,
thereby interpolating potential values if fewer words/tokens are
present. Generally, standard textbooks, such as Biber et al. (1998:
263), don't seem to see any problem in normalising in interpolating in
this way, but we should always remember that counting operation may
already produce errors, e.g. due to incorrect tokenisation, etc., so
that that by interpolating in this way, you may be introducing further
errors. Thus, it would generally be more sensible to do as Lindquist
(2009: 42) suggests, which is to normalise "normalise[] downwards to a
figure close to the size of the smallest corpus", i.e. to what I'd call
the 'highest sensible common denominator', sensible because you probably
wouldn't want to normalise down to any 'irregular' number.
I'm not sure whether anyone has ever tested the potential error rate of
such interpolations, which would potentially be possible by comparing
frequencies of smaller sub-corpora to a full corpus, but, as I've tried
to point out before, it's probably best to avoid any potential source of
error in the first place to be able to provide more or less truly
representative frequency estimates.
References:
Biber, D., Conrad, S. & Reppen, R. (1998). 'Corpus Linguistics'. CUP.
Lindquist, H. (2009). 'Corpus Linguistics and the Description of
English'. EUP.
--
Cheers,
Martin
========================
Dr. phil. habil. Martin Weisser
Visiting Professor
School of English and Education
Guangdong University of Foreign Studies
510006 Guangzhou
P.R. China
Web: martinweisser.org
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list