<html>

<body>

Hullo Jenny! Merely to add that depending on the kind of phenomenon you

are examining and the frequency, it is possible to normalise to a per

ten-thousand figure too.<br><br>

The thing to watch out for if you're working with corpora of different

sizes is that the total number of lemmas (lemmata/types) will increase

with a bigger corpus, so that statements about statements about the top

x% of lemmas will not be meaningful for corpora of different sizes (eg

'the word "confluence" is in the group of 40% most frequently

occurring word').<br><br>

Cheers,<br>

Peter (who met you at Asialex)<br><br>

At 17.04 12/9/2005 +0800, Jenny Eagleton wrote:<br>

<blockquote type=cite class=cite cite>Thanks for the quick response from

everybody, I have got the idea now.<br><br>

Jenny<br>


<dl>

<dd><font face="verdana" size=2>----- Original Message -----<br>


<dd>Subject: </b>Re: [Corpora-List] "normalizing" frequencies

for different-sized corpora<br>


<dd>From: </b>eric@comp.leeds.ac.uk<br>


<dd>To: </b>jenny@asian-emphasis.com<br>


<dd>Date: </b>12-09-2005 16:59<br>

</font><br><br>


<dd>Jenny,<br><br>


<dd>I may be missing something, but I think the way to find a

per-thousand<br>


<dd>figure is simply:<br><br>

<br>


<dd>( (freq of word) / (no of words in text) ) * 1000<br><br>


<dd>eg (200/4000) * 1000 = 50<br><br>


<dd>or (2646/55166) * 1000 = 48 (to nearest whole number)<br><br>


<dd>- of course it's up to you whether to round to nearest whole

n7umber,<br>


<dd>or give the answer to 2 decimal palces (47.96) or some other

level<br>


<dd>of accuracy; but since generally a text is only a sample or<br>


<dd>approximation of the language you are studying, it is sensible not

to<br>


<dd>claim too much accuracy/significance.<br><br>


<dd>eric atwell<br><br>

<br>


<dd>On Mon, 12 Sep 2005, Jenny Eagleton wrote:<br><br>


<dd>> Hello Corpora and Statistics Experts,<br>


<dd>><br>


<dd>> This is a very simple question for all the<br>


<dd>> corpora/statistics experts<br>


<dd>> out there, but this novice is not really<br>


<dd>> mathematically inclined. I<br>


<dd>> understand Biber's principle of "normalization,<br>


<dd>> however I am not sure<br>


<dd>> about how to calculate it. I want frequency counts<br>


<dd>> normalized per<br>


<dd>> 1,000 words of text. I can see how to do it if the<br>


<dd>> figures are even,<br>


<dd>> i.e. if I have a corpus of 4,000 words and a<br>


<dd>> frequency of 200,&#160;<br>


<dd>> I would have a normalized figure of 50.<br>


<dd>><br>


<dd>> But for mixed numbers, how would I calculate the<br>


<dd>> following: For<br>


<dd>> example if I have 2,646 instances of a certain<br>


<dd>> kind of noun in a<br>


<dd>> corpus of 55,166 how would I calculate the<br>


<dd>> normalized figure per<br>


<dd>> 1,000 words?<br>


<dd>><br>


<dd>> Regards,<br>


<dd>><br>


<dd>> Jenny<br>


<dd>> Research Assistant<br>


<dd>> Dept. of English & Communication<br>


<dd>> City University of Hong Kong<br>


<dd>><br>


<dd>><br>


<dd>><br><br>


<dd>-- <br>


<dd>Eric Atwell, Senior Lecturer, Language research group, School of

Computing, <br>


<dd>Faculty of Engineering, University of Leeds, LEEDS LS2 9JT,

England<br>


<dd>TEL: +44-113-2335430 FAX: +44-113-2335468

<a href="http://www.comp.leeds.ac.uk/eric">http://www.comp.leeds.ac.uk/eric</a><br>


</dl></blockquote></body>

</html>