<html>
<body>
Hullo Jenny! Merely to add that depending on the kind of phenomenon you
are examining and the frequency, it is possible to normalise to a per
ten-thousand figure too.<br><br>
The thing to watch out for if you're working with corpora of different
sizes is that the total number of lemmas (lemmata/types) will increase
with a bigger corpus, so that statements about statements about the top
x% of lemmas will not be meaningful for corpora of different sizes (eg
'the word "confluence" is in the group of 40% most frequently
occurring word').<br><br>
Cheers,<br>
Peter (who met you at Asialex)<br><br>
At 17.04 12/9/2005 +0800, Jenny Eagleton wrote:<br>
<blockquote type=cite class=cite cite>Thanks for the quick response from
everybody, I have got the idea now.<br><br>
Jenny<br>
<dl>
<dd><font face="verdana" size=2>----- Original Message -----<br>
<dd>Subject: </b>Re: [Corpora-List] "normalizing" frequencies
for different-sized corpora<br>
<dd>From: </b>eric@comp.leeds.ac.uk<br>
<dd>To: </b>jenny@asian-emphasis.com<br>
<dd>Date: </b>12-09-2005 16:59<br>
</font><br><br>
<dd>Jenny,<br><br>
<dd>I may be missing something, but I think the way to find a
per-thousand<br>
<dd>figure is simply:<br><br>
<br>
<dd>( (freq of word) / (no of words in text) ) * 1000<br><br>
<dd>eg (200/4000) * 1000 = 50<br><br>
<dd>or (2646/55166) * 1000 = 48 (to nearest whole number)<br><br>
<dd>- of course it's up to you whether to round to nearest whole
n7umber,<br>
<dd>or give the answer to 2 decimal palces (47.96) or some other
level<br>
<dd>of accuracy; but since generally a text is only a sample or<br>
<dd>approximation of the language you are studying, it is sensible not
to<br>
<dd>claim too much accuracy/significance.<br><br>
<dd>eric atwell<br><br>
<br>
<dd>On Mon, 12 Sep 2005, Jenny Eagleton wrote:<br><br>
<dd>> Hello Corpora and Statistics Experts,<br>
<dd>><br>
<dd>> This is a very simple question for all the<br>
<dd>> corpora/statistics experts<br>
<dd>> out there, but this novice is not really<br>
<dd>> mathematically inclined. I<br>
<dd>> understand Biber's principle of "normalization,<br>
<dd>> however I am not sure<br>
<dd>> about how to calculate it. I want frequency counts<br>
<dd>> normalized per<br>
<dd>> 1,000 words of text. I can see how to do it if the<br>
<dd>> figures are even,<br>
<dd>> i.e. if I have a corpus of 4,000 words and a<br>
<dd>> frequency of 200, <br>
<dd>> I would have a normalized figure of 50.<br>
<dd>><br>
<dd>> But for mixed numbers, how would I calculate the<br>
<dd>> following: For<br>
<dd>> example if I have 2,646 instances of a certain<br>
<dd>> kind of noun in a<br>
<dd>> corpus of 55,166 how would I calculate the<br>
<dd>> normalized figure per<br>
<dd>> 1,000 words?<br>
<dd>><br>
<dd>> Regards,<br>
<dd>><br>
<dd>> Jenny<br>
<dd>> Research Assistant<br>
<dd>> Dept. of English & Communication<br>
<dd>> City University of Hong Kong<br>
<dd>><br>
<dd>><br>
<dd>><br><br>
<dd>-- <br>
<dd>Eric Atwell, Senior Lecturer, Language research group, School of
Computing, <br>
<dd>Faculty of Engineering, University of Leeds, LEEDS LS2 9JT,
England<br>
<dd>TEL: +44-113-2335430 FAX: +44-113-2335468
<a href="http://www.comp.leeds.ac.uk/eric">http://www.comp.leeds.ac.uk/eric</a><br>
</dl></blockquote></body>
</html>