[Corpora-List] [Corpora List] Absolute Frequencies

Thu Feb 27 12:56:20 UTC 2014

Hi Bob (if I may)

Is your post in response to Cedric Krummes's post which i responded to previously?

If so, I cannot agree with (but perhaps I did not fully understand it?) your argument...

To take your example case... a corpus of encyclopedia articles about Australia,
and a corpus of articles from the *same* encyclopedia about *England/UK*
[to keep variables to a minimum] would indeed show differences in their word
frequency lists.

However, assuming that all the articles in that encyclopedia were written in
approximately the same 'house style', and followed the same guidelines for selection of topic
and manner of treatment of various topic types, but that there was a significant difference
in the corpus sizes (eg, say, A = Australia = 10 texts = 100,000 words, E = England/UK
= 20 texts = 200,000 words [or vice versa, depending perhaps on which country
it was produced in, or for which market]), a comparison of raw/absolute word
frequencies might suggest that, say, 'the' was twice as frequent in corpus B as in
corpus A.

By 'normalising' both raw/absolute frequency lists to frequency per 1000
words, would we not obtain a better comparison? The fact that 'Sydney' or 'kangaroo'
was more frequent in corpus A than B would be useful linguistic confirmation of
our real-world knowledge of the primary topic of each corpus, and interesting insights
into which features of Australia or England/UK were deemed to be important by one
encyclopedia publishing house?

The decision about whether to normalise to 'frequency per 1 million words' or 'frequency
per 10,000 words' or 'frequency per 1000 words' is not a decision about the 'normality'
or 'abnormality' of the corpus contents? That could only be addressed by compiling corpora
of other topics, other encyclopedias, other data-types/genres, etc?

best wishes
Ramesh

----
Date: Wed, 26 Feb 2014 10:48:16 -0600
From: "Robert A. Amsler" <amsler at cs.utexas.edu>
Subject: Re: [Corpora-List] [Corpora List] Absolute Frequencies
To: Corpora at uib.no

What seems to be the problem is that normalization is dependent upon
assuming the separate sub-corpora are themselves 'normal', that is, that
they don't individually differ from each other or from 'average text' of
the language in significant ways. I.e., so maybe to 'correctly' do
normalization one needs to know how 'normal' each sub-corpus in the
collection is and compensate for their degree of abnormality?

It could be that if one examined some particular set of vocabulary in the
separate corpora, such as the ratio of the frequency of the commonest
function words to overall size of the corpus, or the frequencies of the
commonest content words one could get individualized normalization
factors.

At it's worst, one might have to normalize the frequencies of the words in
each sub-corpus that were 'abnormal'.

For example, imagine a corpus consisting of sub-corpora, each of which is
a collection of encyclopedia articles about the same subject (e.g., all
the encyclopedia articles about "Australia"; all the encyclopedia articles
about "London", etc. as sub-coprora). Each sub-corpus would have its key
concepts that would have significantly higher frequencies for certain
words that were directly tied to the subject of the article. To
'normalize' these texts (i.e., to make them each act like 'average' text),
those content words would have to have their frequencies 'normalized' to
more average frequencies before they were combined together?
----
On 25/02/2014 13:56, Krishnamurthy, Ramesh wrote:
> Hi Cedric
>
> As far as I know, and as you say, normalisation is just a convenient
> way to compare two corpora that are unequal in size.
>
> Therefore there is no right and wrong ratio. But the choice of
> normalisation ratio is supposed to increase the ease of comparison.
>
> Consider the ACORN corpus subcorpora sizes:
> http://acorn.aston.ac.uk/acorn_publication.html 2. More information
> about the texts in the ACORN corpora: a) English
> corpora<http://acorn.aston.ac.uk/English%20corpora%20descriptions%20-RK280510.doc>
>
>
b) French
corpora<http://acorn.aston.ac.uk/French%20corpora%20descriptions%20-RK280510.doc>
> c) German
> corpora<http://acorn.aston.ac.uk/German%20corpora%20descriptions%20-RK280510.doc>
>
>
d) Spanish
corpora<http://acorn.aston.ac.uk/Spanish%20corpora%20descriptions%20-RK280510.doc>
>
> The Spanish subcorpora are fewest, so let us look at their sizes: 1.
> Academic texts [0.3m words] 2. French Presidential website pages
> [0.2m words] 3. European Commission [0.3m words] 4. European
> Parliament [30m words] 5. Literary Classics [0.4m words] 6. Nobel
> Prize acceptance speeches [0.02m words] 7. Deutschland Online [0.04m
> words]
>
> If we normalised to 'per million', the most frequent word in most of
> the subcorpora, 'de', would have normalised frequencies of 1.
> 20899/300000 x 1000000 = 69663/million 2. 14783/200000 x 1000000 =
> 73915/million 3. 28247/300000 x 1000000 = 94156/million 4.
> 1871627/30000000 x 1000000 = 62387/million 5. 21267/400000 x 1000000
> = 53168/million 6. 1298/20000 x 1000000 = 64900/million 7. 3116/40000
> x 1000000 = 77900/million
>
> We decided that such large numbers were not easy to work with, and we
> therefore normalised to 10,000: 1. 715 /10,000 2. 858 3. 905 4. 725
> 5. 566 6. 622 7. 825
>
> we gave tables of normalised frequencies for the 500 most frequent
> words in each subcorpus, so by using 'per/10,000', even the 500th
> item in each subcorpus would be above 0: 1. 2.3 / 10,000 2. 2.6 3.
> 2.4 4. 2.3 5. 2.2 6. 1.9 7. 2.4
>
> Users of the corpus would know that any item less frequent than the
> 500th item would occur less than 1.0 times in a corpus of 10,000
> words.
>
> hope this helps ramesh
>
>
>
>
> --------- Message: 7 Date: Tue, 25 Feb 2014 11:16:43 +0100 From:
> Cedric Krummes <cedric.krummes at uni-leipzig.de> Subject:
> [Corpora-List] Absolute Frequencies To: Corpora at uib.no
>
> Dear colleagues,
>
> I cannot get my head around normalised token figures. Please help.
>
> I have two corpora. Corpus "Foo" has 1,000 bigrams (tokens) and
> corpus "Bar" has 4,000 bigrams (tokens). Both corpora are under
> 500,000 tokens, so quite small corpora. I normalised the bigram token
> figures per 1 million tokens. (20,000 vs. 40,000)
>
> I have now been advised that these should be normalised to a smaller
> total number of tokens.
>
> Does it matter whether normalisation is at 1 Million tokens or at,
> say, 10,000 tokens? If it's just to make something relative and,
> maybe, to do some descriptive stats, than surely any normalisation is
> good.
>
> Best wishes,
>
> Cédric Krummes -- Dr. Cédric Krummes "SMS Communication in
> Switzerland"
>
> Universität Leipzig · +49-341-97-37404
> http://www.cedrickrummes.org/contact/
>
>

--
Dr. Cédric Krummes
"SMS Communication in Switzerland"

Universität Leipzig · +49-341-97-37404
http://www.cedrickrummes.org/contact/

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora