Corpora: Help on Frequencies

Pascual Cantos pcantos at
Fri Oct 5 09:40:59 UTC 2001

Dear List Members,

Many corpus-based applications on foreign language materials and dictionary
making, among other, mostly rely on raw frequencies (absolute and/or
relative frequencies) of word forms, lemmas, bi-grams, etc. Frequencies
indices are taken into account in order to decide whether an item should be
considered or not.

And here are my doubts:
What do frequencies exactly tell? 
And more interesting, what do they hide?
How misleading/erroneous can they be?
How far can we rely on them? 
What other features/aspects/measures should also be considered?
Are there ways/techniques to "correct" frequencies indices, statistically?

I would most appreciate ideas, comments and literature on this issue. 
I do also promise to send a summary of all mails received.

Un saludo y un millón de gracias


Dr. Pascual Cantos Gómez

Departamento de Filología Inglesa
Universidad de Murcia
C/. Santo Cristo, 1
30071 Murcia (Spain)

Tel.:	+34 968 364365
Fax:	+34 968 363185
E-mail:	pcantos at

More information about the Corpora mailing list