[Corpora-List] Absolute Frequencies

Cedric Krummes cedric.krummes at uni-leipzig.de
Tue Feb 25 10:16:43 UTC 2014


Dear colleagues,

I cannot get my head around normalised token figures. Please help.

I have two corpora. Corpus "Foo" has 1,000 bigrams (tokens) and corpus 
"Bar" has 4,000 bigrams (tokens). Both corpora are under 500,000 tokens, 
so quite small corpora. I normalised the bigram token figures per 1 
million tokens. (20,000 vs. 40,000)

I have now been advised that these should be normalised to a smaller 
total number of tokens.

Does it matter whether normalisation is at 1 Million tokens or at, say, 
10,000 tokens? If it's just to make something relative and, maybe, to do 
some descriptive stats, than surely any normalisation is good.

Best wishes,

Cédric Krummes
-- 
Dr. Cédric Krummes
"SMS Communication in Switzerland"

Universität Leipzig · +49-341-97-37404
http://www.cedrickrummes.org/contact/

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list