[Corpora-List] Absolute Frequencies

Tue Feb 25 12:56:45 UTC 2014

Hi Cedric

As far as I know, and as you say, normalisation is just a
convenient way to compare two corpora that are unequal in size.

Therefore there is no right and wrong ratio. But the choice of
normalisation ratio is supposed to increase the ease of comparison.

Consider the ACORN corpus subcorpora sizes:
http://acorn.aston.ac.uk/acorn_publication.html
2. More information about the texts in the ACORN corpora:
a) English corpora<http://acorn.aston.ac.uk/English%20corpora%20descriptions%20-RK280510.doc>
b) French corpora<http://acorn.aston.ac.uk/French%20corpora%20descriptions%20-RK280510.doc>
c) German corpora<http://acorn.aston.ac.uk/German%20corpora%20descriptions%20-RK280510.doc>
d) Spanish corpora<http://acorn.aston.ac.uk/Spanish%20corpora%20descriptions%20-RK280510.doc>

The Spanish subcorpora are fewest, so let us look at their sizes:
1. Academic texts [0.3m words]
2. French Presidential website pages [0.2m words]
3. European Commission [0.3m words]
4. European Parliament [30m words]
5. Literary Classics [0.4m words]
6. Nobel Prize acceptance speeches [0.02m words]
7. Deutschland Online [0.04m words]

If we normalised to 'per million', the most frequent word in most of the
subcorpora, 'de', would have normalised frequencies of
1. 20899/300000 x 1000000 = 69663/million
2. 14783/200000 x 1000000 = 73915/million
3. 28247/300000 x 1000000 = 94156/million
4. 1871627/30000000 x 1000000 = 62387/million
5. 21267/400000 x 1000000 = 53168/million
6. 1298/20000 x 1000000 = 64900/million
7. 3116/40000 x 1000000 = 77900/million

We decided that such large numbers were not easy to work with,
and we therefore normalised to 10,000:
1. 715 /10,000
2. 858
3. 905
4. 725
5. 566
6. 622
7. 825

we gave tables of normalised frequencies for the 500 most frequent words
in each subcorpus, so by using 'per/10,000', even the 500th item in each
subcorpus would be above 0:
1. 2.3 / 10,000
2. 2.6
3. 2.4
4. 2.3
5. 2.2
6. 1.9
7. 2.4

Users of the corpus would know that any item less frequent than the 500th item
would occur less than 1.0 times in a corpus of 10,000 words.

hope this helps
ramesh

---------
Message: 7
Date: Tue, 25 Feb 2014 11:16:43 +0100
From: Cedric Krummes <cedric.krummes at uni-leipzig.de>
Subject: [Corpora-List] Absolute Frequencies
To: Corpora at uib.no

Dear colleagues,

I cannot get my head around normalised token figures. Please help.

I have two corpora. Corpus "Foo" has 1,000 bigrams (tokens) and corpus
"Bar" has 4,000 bigrams (tokens). Both corpora are under 500,000 tokens,
so quite small corpora. I normalised the bigram token figures per 1
million tokens. (20,000 vs. 40,000)

I have now been advised that these should be normalised to a smaller
total number of tokens.

Does it matter whether normalisation is at 1 Million tokens or at, say,
10,000 tokens? If it's just to make something relative and, maybe, to do
some descriptive stats, than surely any normalisation is good.

Best wishes,

Cédric Krummes
--
Dr. Cédric Krummes
"SMS Communication in Switzerland"

Universität Leipzig · +49-341-97-37404
http://www.cedrickrummes.org/contact/

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora