6.1712, FYI: LSA Varia, Linguistic Data Consortium and CELEX
The Linguist List
linguist at tam2000.tamu.edu
Thu Dec 7 20:25:33 UTC 1995
---------------------------------------------------------------------------
LINGUIST List: Vol-6-1712. Thu Dec 7 1995. ISSN: 1068-4875. Lines: 246
Subject: 6.1712, FYI: LSA Varia, Linguistic Data Consortium and CELEX
Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at tam2000.tamu.edu>
Helen Dry: Eastern Michigan U. <hdry at emunix.emich.edu>
T. Daniel Seely: Eastern Michigan U. <dseely at emunix.emich.edu>
Associate Editor: Ljuba Veselinova <lveselin at emunix.emich.edu>
Assistant Editors: Ron Reck <rreck at emunix.emich.edu>
Ann Dizdar <dizdar at tam2000.tamu.edu>
Annemarie Valdez <avaldez at emunix.emich.edu>
Software development: John H. Remmers <remmers at emunix.emich.edu>
Editor for this issue: lveselin at emunix.emich.edu (Ljuba Veselinova)
---------------------------------Directory-----------------------------------
1)
Date: Thu, 07 Dec 1995 08:28:24 EST
From: ZZLSA at gallua.gallaudet.edu
Subject: LSA varia
2)
Date: Thu, 07 Dec 1995 20:31:53 +0100
From: Richard.Piepenbrock at mpi.nl (Richard Piepenbrock)
Subject: NEW RELEASE from the LINGUISTIC DATA CONSORTIUM and CELEX
---------------------------------Messages------------------------------------
1)
Date: Thu, 07 Dec 1995 08:28:24 EST
From: ZZLSA at gallua.gallaudet.edu
Subject: LSA varia
REMINDERS FROM LSA
--If you plan to attend the LSA Annual Meeting in San Diego,
please register for the meeting and make your hotel reservations.
For more information and/or forms, please contact the LSA
Secretariat: 202-835-1714; zzlsa at gallua.gallaudet.edu
--Program Heads and Department Chairs are reminded to complete and
return the questionnaire sent by the LSA Committee on Ethnic
Diversity in Linguistics. Replies were requested by 11 December
and should be sent asap.
------------------------------------------------------------------------
2)
Date: Thu, 07 Dec 1995 20:31:53 +0100
From: Richard.Piepenbrock at mpi.nl (Richard Piepenbrock)
Subject: NEW RELEASE from the LINGUISTIC DATA CONSORTIUM and CELEX
Announcing a
NEW RELEASE from the
LINGUISTIC DATA CONSORTIUM
and the
CENTRE FOR LEXICAL INFORMATION
This message announces the Second Release of the CELEX CD-ROM with
lexical data from the Dutch Centre for Lexical Information and the
Linguistic Data Consortium.
This CD-ROM contains an enhanced, expanded version of the German
lexical database (2.5), featuring approximately 1000 new lemma
entries, revised morphological parses, verb argument structures,
inflectional paradigm codes, and a corpus type lexicon. A complete
PostScript version of the German Linguistic Guide is also included, in
both European A4-format and American Letter format. For German, the
total number of lemmas included is now 51,728, while all their
inflected forms number 365,530.
Moreover, phonetic syllable frequencies have been added for (British)
English and Dutch. Apart from this, and the provision of frequency
information alongside every lexical feature, no changes have been made
to the Dutch and English lexicons.
Complete AWK-scripts are now provided to compute representations not
found in the (plain ASCII) lexical data files, corresponding to the
features described in the CELEX User Guide, which is included on the
CD as well.
For each language, i.e. English, German and Dutch, the CD-ROM contains
detailed information on the orthography (variations in spelling,
hyphenation), the phonology (phonetic transcriptions, variations in
pronunciation, syllable structure, primary stress), the morphology
(derivational and compositional structure, inflectional paradigms),
the syntax (word class, word-class specific subcategorisations,
argument structures), and word frequency (summed word and lemma
counts, based on recent and representative text corpora) of both
wordforms and lemmas. Unique identity numbers allow the linking of
information from different files with the aid of an efficient,
index-based C-program.
Like its predecessor, the CD-ROM is mastered using the ISO 9660 data
format, with the Rock Ridge extensions, allowing it to be used in VMS,
MS-DOS, Macintosh and UNIX environments. As the new release does not
omit any data from the first edition, the current release will replace
the old one.
Institutions that have membership in the LDC during the 1995 or 1996
Membership Years will be able to receive CELEX for research purposes
only at no additional charge, in the same manner as all other text and
speech corpora published by the LDC.
Non-members can receive a copy of CELEX for research purposes only for
a fee of $150. If you would like to order a copy of this corpus,
please email your request to ldc at unagi.cis.upenn.edu, or fax it to
(215) 573-2175. If you need additional information before placing your
order, or would like to inquire about membership in the LDC, please
send email or call (215) 898-0464.
Further information about the LDC and its available corpora can be
accessed on the Linguistic Data Consortium WWW Home Page at URL
http://www.cis.upenn.edu/~ldc. More information specific to CELEX can
be accessed via hyperlinks from this Home Page. Information is also
available via ftp at ftp.cis.upenn.edu under pub/ldc; for ftp access,
please use "anonymous" as your login name, and give your email address
when asked for password.
A brief overview of the revised German data on the CD is given below:
THE GERMAN DATABASE
When starting to use the German database, the user first has to choose
between three so-called `lexicon types':
- a lemma lexicon
- a wordform lexicon
- a corpus type lexicon
Each lexicon type uses a specific kind of entry. The CELEX lemma
lexicon is the one most similar to an ordinary dictionary since every
entry in this lexicon represents a set of related inflected words. In
a lexicon, a lemma can be represented by using a headword (cf.
traditional dictionary entries) such as, for example, `helfen' (help)
or `Hund' (dog), or by a stem such as, for example, 'helf' or 'Hund'.
The wordform lexicon yields all possible inflected words: every entry
in the lexicon is an inflectional variant of the related headword or
stem. So, a wordform lexicon contains words like `helfe', `hilft',
`geholfen', `huelfe', `Hundes', `Hunde' and so on. A corpus type
lexicon, on the other hand, simply gives you an ordered list of all
alphanumeric strings found in the corpus with raw string counts,
undisambiguated for relations to either lemmas or wordforms.
For all types of lexicons, the user may subsequently select any number
of columns -- from approximately 200 database columns -- combining
information on the orthography, phonology, morphology, syntax and
frequency of the entries.
LEXICAL DATA, GERMAN
The lexical data that can be selected for each entry in the different
German lexicon types can be divided into five categories: orthography,
phonology, morphology, syntax and frequency.
Orthography - with or without diacritics
(spelling) - with or without word division positions
- number of letters/syllables
Phonology - phonetic transcriptions which use different notations
(pronunciation) like SAMPA or CPA and include:
- syllable boundaries
- primary stress markers
- consonant-vowel patterns
- number of phonemes/syllables
Morphology - Derivational/compositional:
(word structure) - division into stems and affixes
- flat or hierarchical representations
- Inflectional:
- stems and their inflections
Syntax - word class
(grammar) - subcategorisations per word class
Frequency - Mannheim frequency(*)
(*) These frequency data are based on the 6 million word corpus
compiled by the Institut fuer Deutsche Sprache in Mannheim, Germany.
EXAMPLE DATA, GERMAN
An arbitrary query using a small German lemma lexicon (that is, one
with very few columns) might yield the following result:
Headword Pronunciation Morphology: M: Cl Freq
Structured Segmentation Cl
----------- ---------------- ------------------------ --- -- ----
helfen "hEl-f at n (helf) V V 1225
Helfer "hEl-f at r ((helf),(er)) Vx N 134
hellaeugig "hEl-Oy-gIx ((hell),(Auge),(ig)) ANx A 0
hellblau "hEl-blau ((hell),(blau)) AA A 28
Hellseher "hEl-ze:- at r (((hell),(seh)),(er)) AVx N 20
hellseherisch "hEl-ze:- at -rIS (((hell),(seh)),(erisch)) AVx A 0
hellwach "hEl-vax ((hell),(((wach),(e)))) AVx A 13
Helm "hElm (Helm) N N 22
Hund "hUnt (Hund) N N 364
Huendchen "hYnt-x at n ((Hund),(chen)) Nx N 7
hundekalt "hUn-d at -kalt ((Hund),(e),(kalt)) NxA A 0
hundemuede "hUn-d at -my:-d@ ((Hund),(e),(muede)) NxA A 3
Hundeschnauze "hUn-d at -Snau-ts@ ((Hund),(e),(Schnauze)) NxN N 1
Hundesteuer "hUn-d at -StOy-@r ((Hund),(e),(Steuer)) NxN N 6
Hundewetter "hUn-d at -vE-t@r ((Hund),(e),(Wetter)) NxN N 0
Huendin "hYn-dIn ((Hund),(in)) Nx N 7
huendisch "hYn-dIS ((Hund),(isch)) Nx A 2
Huene "hy:-n@ (Huene) N N 13
huenenhaft "hy:-n at n-haft ((Huene),(n),(haft)) Nxx A 4
Hunger "hU-N at r (Hunger) N N 102
Hungerkur "hU-N at r-ku:r ((Hunger),(Kur)) NN N 5
Hungerlohn "hU-N at r-lo:n ((Hunger),(Lohn)) NN N 6
hungern "hU-N at rn ((Hunger)) N V 33
Hungersnot "hU-N at rs-no:t ((Hunger),(s),(Not)) NxN N 23
Hungerstreik "hU-N at r-Straik ((Hunger),((streik))) NV N 14
Richard Piepenbrock
CELEX Project Manager
C
-- C E L E X --
-- The Centre for Lexical Information -- C
C C C
C
Max Planck Institute for Psycholinguistics C CCCCCC
Wundtlaan 1 C CCCCCCCCCCCCC
6525 XD NIJMEGEN C C C CCCCCCCCCCCCCCCC
The Netherlands CCCCCCCCCC CC
C CCCCCCCC
Tel: (+31) (0)24 - 3615797 CCCCCCCC
Fax: (+31) (0)24 - 3521213 CCCCCCCC
CCCCCCCC
CCCCCCCC
E-mail: celex at mpi.nl CCCCCCCC
CCCCCCCC
WWW-page: http://www.kun.nl/celex/ CCCCCCCC
CCCCCCCCC
CCCCCCCCCCC
------------------------------------------------------------------------
LINGUIST List: Vol-6-1712.
More information about the LINGUIST
mailing list