Corpora: Summary: 'Help on Frequencies'

Mon Oct 15 08:39:33 UTC 2001

Dear List Members,

Last week I forwarded a message, asking for help (hints, commments,
literature, etc.) on frequency occurrences (see original message below).

Thanks a million to the people who answered:

Tony Berber Sardinha
Raphael Salkie
Adam Kilgarriff
William Mann
Daniel Walker
Linda Bawcom
Jerome Richalot

This is a summary of the comments, literature and websites suggested:

Comments:

William Mann:
Remember that in the very early reports on the Brown Corpus (the grandfather
of all), the word "jabberwocky" showed up with fairly high frequency.

Daniel Walker:
> What do frequencies exactly tell?
Well, frequencies can give an idea of how likely some event is. A nice
analogy is the linguistic notion of markedness. The more likely a
linguistic phenomenon, the more marked it is and vice versa. More
generally, statistics provide a well formed way to incorporate empirical
evidence into linguistic studies.
> And more interesting, what do they hide?
> How misleading/erroneous can they be?
> How far can we rely on them?
It's hard to make inferences about infrequent events. This is both a good
and a bad thing. For example, sentences which would fail a grammaticality
judgement may be infrequent, providing empirical support for native
intuition. On the other hand, most of language is infrequent (This is
similar to Chomsky's notion of Poverty of Stimulus.) which means it can be
very difficult to collect examples of interesting phenomena. Most texts
have a bias towards some domain and can be misleading. For example, just
because the bilingual proceedings of the Canadian parliament translate
'House' as 'Chambre' 75% of the time doesn't necessarily indicate that
'House' rarely means 'maison'. The limitations of statistics in linguistics
varies according to what you're measuring and how you measure it. There are
well formed technics for making cut-off and significance decisions, but
there is also a need for experimentation and maybe even art.
> What other features/aspects/measures should also be considered?
> Are there ways/techniques to "correct" frequencies indices, statistically?
> I would most appreciate ideas, comments and literature on this issue.
There are many interesting and useful statistics that one can take from
some body of text and many technics can be used to "correct" or smooth
counts. I would suggest reading "The linguist's guide to statistics" by
Krenn, et.al. http://citeseer.nj.nec.com/krenn97linguists.html

Linda Bawcom:
John Sinclair (1991) Corpus, Concordance, Collocation, dice  'Any instance
of language depends on its surrounding context.  The details of choice shown
in any segment of a text depend-some of them-on choices made elsewhere in
the text, and so no example is ever complete unless it is a whole text'. (p.
5)
Y tambien Michael Hoey decia en la conferencia de TESOL Spain
(1997?)-'Worldlists homogenize the heterogeneous'
Por eso,  para mi, la frequencia de una palabra es solo el premier paso-o
sea, es interesante en si, pero no tiene tanto importancia (a menos que una
esta haciendo un diccionario como COBUILD). Para mi, (como profesora) lo más
importante es el contexto, como la palabra 'collicates' , 'colligates' o
'co-occurs'. Es decir, si soy un aprendiz de un idoma, y encima perezosa (lo
que soy!), y si mi profesor/a mi dice que dos palabras son sinonimos-yo voy
a aprender solo una.
Lo que si he visto es 1) (en cuanto a fier de un corpus) lo que vas a sacar
de un corpus depende mucho del corpus-tiene que tenir mucho cuidado con la
proposito de el . 2) no se puede clasificar 'whole sets' de palabras como se
hacen en libros de texto para la aprendaje (e.g. maneras de mirar, maneras
de tocar) sin dar un contexto.
Un ejemplo-estoy mirando (por un presentación) la diferencia  entre tal vez
y quizá-lo que he visto es que quizá es seguido 8 veces mas por 'por eso' o
para mas alguna razón' que tal vez-y tambien los dos tiene, en sus
contextos, casi la mitdad de los instantes, un 'negation'-no sé porque.
Ahora, como nativo tu, sin duda, ya lo sabia. Pero, yo estaba surprendida..

Jerome Richalot:
"Statistics for corpus linguistics" by Michael P. Oakes (Edin. Textbooks in
empirical linguistics, EUP) seems like a good place to start. It goes beyond
raw frequencies and purely descriptive statistics into inferential
statistics.
Chapter 1 sarts with a quote (de Haan and van Hout 1986) referring to
descriptive statistics ans "the useful loss of information." This I
understand at least as one should indeed be aware that some information is
lost through purely descriptive statistics. I just wonder how "useful" it is!

Literature/Websites:

Charniak, Eugene. 1993. Statistical Language Learning. Cambridge, MA: MIT
Press

Manning, Christopher, and Hinrich Schütze. 1999. Foundations of Statistical
Natural Language    Processing. Cambridge, MA: MIT Press

Jurafsky, Dan, and James Martin. 2000. Speech and Language Processing: An
Introduction to     Natural Language Processing, Computational Linguistics
and Speech Recognition. Upper      Saddle River, NJ: Prentice Hall

John Sinclair (1991) Corpus, Concordance, Collocation

Michael Hoey (1997) 'Worldlists homogenize the heterogeneous' Conference,
TESOL Spain.

Michael P. Oakes (1997) Statistics for corpus linguistics (Edin. Textbooks in
empirical linguistics, EUP)

author = "Adam Kilgarriff",
    title = "Putting Frequencies into the Dictionary",
    journal = "International Journal of Lexicography",
    year = 1997,
    volume = 10,
    number = 2,
    pages = {135--155}

  author =       {Adam Kilgarriff},
  title =        {Comparing Corpora},
  journal =      {International Journal of Corpus inguistics},
  year =         {forthcoming},
  volume =       {??},
  number =       {??},
  pages =        {00--00}

Adam Kilgarriff and Raphael Salkie:
Corpus similarity and homogeneity via word frequency.  In M. Gellerstam et
al (eds), EURALEX '96 Proceedings (Göteborg, Göteborg University, 1996),
121-30. http://info.ox.ac.uk/bnc/using/papers/kilgarriff96a.html

"The linguist's guide to statistics" by Krenn, et.al.
http://citeseer.nj.nec.com/krenn97linguists.html

------------------------------------------------------------
Many corpus-based applications on foreign language materials and dictionary
making, among other, mostly rely on raw frequencies (absolute and/or
relative frequencies) of word forms, lemmas, bi-grams, etc. Frequencies
indices are taken into account in order to decide whether an item should be
considered or not.

And here are my doubts:
What do frequencies exactly tell? 
And more interesting, what do they hide?
How misleading/erroneous can they be?
How far can we rely on them? 
What other features/aspects/measures should also be considered?
Are there ways/techniques to "correct" frequencies indices, statistically?

I would most appreciate ideas, comments and literature on this issue. 
I do also promise to send a summary of all mails received.

Un saludo y un millón de gracias

Pascual

------------------------------------------------

-----------------------------------------------------
Dr. Pascual Cantos Gómez

Departamento de Filología Inglesa
Universidad de Murcia
C/. Santo Cristo, 1
30071 Murcia (Spain)

Tel.:	+34 968 364365
Fax:	+34 968 363185
E-mail:	pcantos at fcu.um.es
http://www.um.es/lacell/miembros/pcg/