8.363, Sum: Corpora

linguist at linguistlist.org linguist at linguistlist.org
Sun Mar 16 17:32:18 UTC 1997


LINGUIST List:  Vol-8-363. Sun Mar 16 1997. ISSN: 1068-4875.

Subject: 8.363, Sum: Corpora

Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at linguistlist.org>
            Helen Dry: Eastern Michigan U. <hdry at linguistlist.org>
            T. Daniel Seely: Eastern Michigan U. <seely at linguistlist.org>

Review Editor:     Andrew Carnie <carnie at linguistlist.org>

Associate Editors: Ljuba Veselinova <ljuba at linguistlist.org>
                   Ann Dizdar <ann at linguistlist.org>
Assistant Editor:  Sue Robinson <sue at linguistlist.org>
Technical Editor:  Ron Reck <ron at linguistlist.org>

Software development: John H. Remmers <remmers at emunix.emich.edu>
                      Zhiping Zheng <zzheng at online.emich.edu>

Home Page:  http://linguistlist.org/

Editor for this issue: Susan Robinson <sue at linguistlist.org>

=================================Directory=================================

1)
Date:  Wed, 12 Mar 1997 11:30:15 -0500 (EST)
From:  Royle Phaedra <roylep at MAGELLAN.UMontreal.CA>
Subject:  Corpora

-------------------------------- Message 1 -------------------------------

Date:  Wed, 12 Mar 1997 11:30:15 -0500 (EST)
From:  Royle Phaedra <roylep at MAGELLAN.UMontreal.CA>
Subject:  Corpora


	I recently made a query on linguist list about corpora for
word lists with frequency counts in Bulgarian, Polish, Greek, Turkish
and English (excluding Kucera and Francis). Many people responded with
helpful comments, which are summarised below. Unfortunately, nothing
was found on Greek. If any additions seem necessary, please write back
to me.

Thanks,

Phaedra
PhD student
Universite de Montreal
Centre de recherche theophile alajouanine

On English:

Gan Wee Keong <ellganwk at leonis.nus.sg>

The British National Corpus word frequency lists generated by Adam
Kilgarriff. As the various lists are categorised in certain manners,
read the README file first before downloading.

To get the lists, do a ftp to:

         ftp.itri.bton.ac.uk/pub/bnc
- -----------------------------------------------------------------

 Richard Piepenbrock <celex at mpi.nl>

THE CELEX CD-ROM PRODUCED BY THE DUTCH CENTRE FOR LEXICAL
INFORMATION
IN COLLABORATION WITH THE LINGUISTIC DATA CONSORTIUM

The Second Release of the CD-ROM, which contains the CELEX lexical
databases of English (version 2.5), Dutch (version 3.1) and German
(version 2.5), is now available for research purposes from the
Linguistic Data Consortium for $150.  For each language, the CD-ROM
contains detailed information on the orthography (variations in
spelling, hyphenation), the phonology (phonetic transcriptions,
variations in pronunciation, syllable structure, primary stress), the
morphology (derivational and compositional structure, inflectional
paradigms), the syntax (word class, word-class specific
subcategorisations, argument structures), and word frequency (summed
word and lemma counts, based on recent and representative text
corpora) of both wordforms and lemmas (English: 52446 lemmas, 160594
wordforms; German: 51728 lemmas, 365530 wordforms; Dutch: 124136
lemmas, 381292 wordforms).

- ------------------------------------------------------------------
 Llu=EDs Padr=F3 <padro at lsi.upc.es>

I have ftp available an English frequency list extracted from 1.1
milion words
 of WSJ.

  ftp anonymous to ftp-lsi.upc.es
  cd pub/lluisp
  get wsj.freq
- ------------------------------------------------------------------

 "M. Lynne Roecklein" <lynne at cc.gifu-u.ac.jp>

You may be wanting only very formal frequency lists, or you've
probably already checked out the following, but if not, there are
'lists of defining words' in the l995 __Cambridge International
Dictionary of English__ (which claims frequency was one of the factors
in the assembly of that list but does not name its references) and the
l993 __Longman Language Activator__ (which refers to the Longman
Corpus Network data concerning frequency).  The Collins Cobuild people
must also have done frequency work on their corpus, which I understand
is rather extensive, to arrive at a defining vocabulary, but nothing
is said in their standard dictionary.  I realize that these
dictionaries are specialized in various ways, but their defining word
list would include only high frequency words.

- -------------------------------------------------------------------

 "James L. Fidelholtz" <jfidel at cen.buap.mx>

	On English: There's always the granddaddy of all frequency
counts, Thorndike & Lorge, ca. 1943, probably still in print at
Columbia U. Teachers College Press (later impressions, of course).
The most accessible 'recent' version would probably be John Carroll's
(title may be slightly off) _The American Heritage word frequency
book_, published approx. 1980 by AH.  There's also some fairly recent
Scandinavian stuff (in the 80's) on English, but I forget now the
authors' names (on the basis, if I remember correctly, of the Brown
corpus).
	If you need more info, let me know, and I'll scour the stacks
at home.  Please let me know what you run across, as I'm always
interested in frequency studies.

- -------------------------------------------------------------------

Ntirampeba Pascal <ntirampp at ERE.UMontreal.CA>

An other english word list is given by :
Johansson, S. & K. Hofland. 1989."Frequency analysis of english
vocabulary and grammar. Oxford:Clarendon Press.

_____________________________________________________________
POLISH

 "James L. Fidelholtz" <jfidel at cen.buap.mx>

	With respect to Polish, there is a frequency count (or is it a
'backwards dictionary'?) for at least some poems of a Polish poet
whose name escapes me at the moment.  Ah, yes, there are frequency
counts of at least the press by, I believe, Topolin'ska (Maria?), but
super hard to come by -- check the OCLC and LC listings -- it would
have been published in the early or middle 70's, in several volumes.
I think I have some of them, but I'm not sure.

- --------------------------------------------------------------------

Andrzej Lyda <kotlet at zeus.polsl.gliwice.pl>

A kind of frequency list was compiled by Tadeusz Piotrowski of
Institute of English, Wroclaw University for the purposes of a
Polish-English dictionary. He has also published: Contemporary
English: Word Lists. Part I-II. Wydawnictwo Uniwersytetu
Wroclawskiego. 1993. ISBN: 83-229-0940-3.

I would also contact PWN, Warsaw (National Scientific Publishers)which
has just published a CD-ROM edition of the Dictionary of Contemporary
Polish.

Andrzej Lyda
Institute of English
University of Silesia
Sosnowiec
Polad

- ------------------------------------------------------------------
Tilman Berger <tilman.berger at uni-tuebingen.de>

There is a frequency dictionary for Polish:

Slownik frekwencyjny polszczyzny wspolczesne. Ed. Ida Kurcz et al.
Krakow: Polska Akademia Nauk, Institut Jezyka Polskiego. Vol. 1, 1990.
Vol. 2, 1990.

Prof. Dr. Tilman Berger
Slavisches Seminar
Universitaet Tuebingen
Wilhelmstr. 50
D-72074 Tuebingen

Tel. 07071/29-76733 (Universitaet)
     07071/63365 (privat)

e-mail: tilman.berger at uni-tuebingen.de
________________________________________________________________
BULGARIAN

Kjetil Ra Hauge <K.R.Hauge at easteur-orient.uio.no>

For Bulgarian:

Nikolova, Cvetanka: CHestoten rechnik na bylgarskata razgovorna rech,
Sofija 1987

Todorova, Elena; Rada Panchovska: CHestoten rechnik na bylgarskata
publicistika (1944-1989), Sofija 1995

The latter is rarer than a Gutenberg bible, the total printing is 25 (!)
copies.

_________________________________________________________________
TURKISH

Kemal Oflazer <ko at cs.bilkent.edu.tr>

We do not have frequency lists yet but for general Turkish stuff
you can look at http://www.nlp.cs.bilkent.edu.tr.
We have some morpological disambiguated corpora poosted there however
they are quite short. We have some root word occurence statistics for
those but they may not be very meaningful.

Kemal Oflazer                   e-mail: ko at cs.bilkent.edu.tr
				http://www.cs.bilkent.edu.tr/~ko/ko.html
Bilkent University              tel: (90-312) 266-4133 (Sec)
Computer Engineering Department	              266-4000 x1258 (Off)
Bilkent, ANKARA, 06533 TURKIYE                240-1627  (Home)
				fax: (90-312) 266-4126		



---------------------------------------------------------------------------
LINGUIST List: Vol-8-363



More information about the LINGUIST mailing list