[Corpora-List] [Corpora List] Absolute Frequencies

Martin Weisser weissermar at gmail.com
Wed Feb 26 04:37:10 UTC 2014


Hi Cédric,
I guess the reason you've been advised to normalise to something 
different than 1 million is that you may accidentally over-exaggerate 
the potential frequency of occurrence if you corpora are smaller than a 
million words/tokens. This is because, at least to my knowledge, the 
standard procedure for normalising is simply to calculate a relative 
frequency and then multiplying this by your normalisation factor, 
thereby interpolating potential values if fewer words/tokens are 
present. Generally, standard textbooks, such as Biber et al. (1998: 
263), don't seem to see any problem in normalising in interpolating in 
this way, but we should always remember that counting operation may 
already produce errors, e.g. due to incorrect tokenisation, etc., so 
that that by interpolating in this way, you may be introducing further 
errors. Thus, it would generally be more sensible to do as Lindquist 
(2009: 42) suggests, which is to normalise "normalise[] downwards to a 
figure close to the size of the smallest corpus", i.e. to what I'd call 
the 'highest sensible common denominator', sensible because you probably 
wouldn't want to normalise down to any 'irregular' number.
I'm not sure whether anyone has ever tested the potential error rate of 
such interpolations, which would potentially be possible by comparing 
frequencies of smaller sub-corpora to a full corpus, but, as I've tried 
to point out before, it's probably best to avoid any potential source of 
error in the first place to be able to provide more or less truly 
representative frequency estimates.

References:
Biber, D., Conrad, S. & Reppen, R. (1998). 'Corpus Linguistics'. CUP.
Lindquist, H. (2009). 'Corpus Linguistics and the Description of 
English'. EUP.
-- 
Cheers,
	Martin
========================
Dr. phil. habil. Martin Weisser
Visiting Professor
School of English and Education
Guangdong University of Foreign Studies
510006 Guangzhou
P.R. China
Web: martinweisser.org

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list