6.1712, FYI: LSA Varia, Linguistic Data Consortium and CELEX

Thu Dec 7 20:25:33 UTC 1995

---------------------------------------------------------------------------
LINGUIST List:  Vol-6-1712. Thu Dec 7 1995. ISSN: 1068-4875. Lines:  246

Subject: 6.1712, FYI: LSA Varia, Linguistic Data Consortium and CELEX

Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at tam2000.tamu.edu>
            Helen Dry: Eastern Michigan U. <hdry at emunix.emich.edu>
            T. Daniel Seely: Eastern Michigan U. <dseely at emunix.emich.edu>

Associate Editor:  Ljuba Veselinova <lveselin at emunix.emich.edu>
Assistant Editors: Ron Reck <rreck at emunix.emich.edu>
                   Ann Dizdar <dizdar at tam2000.tamu.edu>
                   Annemarie Valdez <avaldez at emunix.emich.edu>

Software development: John H. Remmers <remmers at emunix.emich.edu>

Editor for this issue: lveselin at emunix.emich.edu (Ljuba Veselinova)

---------------------------------Directory-----------------------------------
1)
Date:  Thu, 07 Dec 1995 08:28:24 EST
From:  ZZLSA at gallua.gallaudet.edu
Subject:  LSA varia

2)
Date:  Thu, 07 Dec 1995 20:31:53 +0100
From:  Richard.Piepenbrock at mpi.nl (Richard Piepenbrock)
Subject:  NEW RELEASE from the LINGUISTIC DATA CONSORTIUM and CELEX

---------------------------------Messages------------------------------------
1)
Date:  Thu, 07 Dec 1995 08:28:24 EST
From:  ZZLSA at gallua.gallaudet.edu
Subject:  LSA varia

REMINDERS FROM LSA

 --If you plan to attend the LSA Annual Meeting in San Diego,
please register for the meeting and make your hotel reservations.
For more information and/or forms, please contact the LSA
Secretariat: 202-835-1714; zzlsa at gallua.gallaudet.edu

 --Program Heads and Department Chairs are reminded to complete and
return the questionnaire sent by the LSA Committee on Ethnic
Diversity in Linguistics.  Replies were requested by 11 December
and should be sent asap.
------------------------------------------------------------------------
2)
Date:  Thu, 07 Dec 1995 20:31:53 +0100
From:  Richard.Piepenbrock at mpi.nl (Richard Piepenbrock)
Subject:  NEW RELEASE from the LINGUISTIC DATA CONSORTIUM and CELEX

                           Announcing a
                        NEW RELEASE from the
                     LINGUISTIC DATA CONSORTIUM
                             and the
                   CENTRE FOR LEXICAL INFORMATION

This message announces the Second Release of the CELEX CD-ROM with
lexical data from the Dutch Centre for Lexical Information and the
Linguistic Data Consortium.

This CD-ROM contains an enhanced, expanded version of the German
lexical database (2.5), featuring approximately 1000 new lemma
entries, revised morphological parses, verb argument structures,
inflectional paradigm codes, and a corpus type lexicon. A complete
PostScript version of the German Linguistic Guide is also included, in
both European A4-format and American Letter format.  For German, the
total number of lemmas included is now 51,728, while all their
inflected forms number 365,530.

Moreover, phonetic syllable  frequencies have been added for (British)
English and Dutch.  Apart  from this, and  the provision  of frequency
information alongside every lexical feature, no changes have been made
to the Dutch and English lexicons.

Complete  AWK-scripts are now  provided to compute representations not
found  in the (plain  ASCII) lexical data  files, corresponding to the
features described  in the CELEX User  Guide, which is included on the
CD as well.

For each language, i.e. English, German and Dutch, the CD-ROM contains
detailed information on the    orthography (variations  in   spelling,
hyphenation),  the phonology (phonetic  transcriptions,  variations in
pronunciation,  syllable  structure,  primary  stress), the morphology
(derivational and    compositional structure, inflectional paradigms),
the  syntax   (word  class,  word-class  specific  subcategorisations,
argument structures), and   word  frequency  (summed word   and  lemma
counts, based  on  recent and  representative  text corpora)  of  both
wordforms  and lemmas. Unique identity  numbers  allow the linking  of
information from different  files    with the  aid  of an   efficient,
index-based C-program.

Like its predecessor,  the CD-ROM is mastered using  the ISO 9660 data
format, with the Rock Ridge extensions, allowing it to be used in VMS,
MS-DOS,  Macintosh and UNIX environments.  As the new release does not
omit any data from the first edition, the current release will replace
the old one.

Institutions  that have membership in the  LDC during the 1995 or 1996
Membership Years will  be able to  receive CELEX for research purposes
only at no additional charge, in the same manner as all other text and
speech corpora published by the LDC.

Non-members can receive a copy of CELEX for research purposes only for
a fee  of $150. If you   would like to order   a copy of  this corpus,
please  email your request  to  ldc at unagi.cis.upenn.edu, or fax  it to
(215) 573-2175. If you need additional information before placing your
order, or would like  to inquire about  membership in the LDC,  please
send email or call (215) 898-0464.

Further information  about the LDC  and its  available corpora can  be
accessed   on the Linguistic   Data Consortium  WWW Home   Page at URL
http://www.cis.upenn.edu/~ldc. More  information specific to CELEX can
be accessed via  hyperlinks from this  Home Page.  Information is also
available via ftp at  ftp.cis.upenn.edu under pub/ldc; for ftp access,
please use "anonymous" as your login name, and give your email address
when asked for password.

A brief overview of the revised German data on the CD is given below:

THE GERMAN DATABASE

When starting to use the German database, the user first has to choose
between three so-called `lexicon types':

   - a lemma lexicon
   - a wordform lexicon
   - a corpus type lexicon

Each lexicon  type uses  a  specific kind of  entry. The   CELEX lemma
lexicon is the one most similar to an  ordinary dictionary since every
entry in this lexicon represents a set of  related inflected words. In
a  lexicon,   a lemma can  be   represented by  using a  headword (cf.
traditional dictionary entries) such as,  for example, `helfen' (help)
or `Hund' (dog), or by a stem such as, for  example, 'helf' or 'Hund'.
The wordform lexicon yields  all possible inflected words: every entry
in the lexicon is  an inflectional variant of  the related headword or
stem. So, a  wordform  lexicon contains  words like  `helfe', `hilft',
`geholfen',  `huelfe', `Hundes', `Hunde'   and  so on.  A corpus  type
lexicon, on  the other hand,  simply gives you  an ordered list of all
alphanumeric strings found in   the  corpus with raw   string  counts,
undisambiguated for relations to either lemmas or wordforms.

For all types of lexicons, the user may subsequently select any number
of columns -- from   approximately 200 database columns  --  combining
information on  the  orthography,  phonology,  morphology,  syntax and
frequency of the entries.

LEXICAL DATA, GERMAN

The lexical data that can be selected  for each entry in the different
German lexicon types can be divided into five categories: orthography,
phonology, morphology, syntax and frequency.

Orthography      - with or without diacritics
(spelling)       - with or without word division positions
                 - number of letters/syllables

Phonology        - phonetic transcriptions which use different notations
(pronunciation)    like SAMPA or CPA and include:
                      - syllable boundaries
                      - primary stress markers
                      - consonant-vowel patterns
                      - number of phonemes/syllables

Morphology       - Derivational/compositional:
(word structure)      - division into stems and affixes
                      - flat or hierarchical representations
                 - Inflectional:
                      - stems and their inflections

Syntax           - word class
(grammar)        - subcategorisations per word class

Frequency        - Mannheim frequency(*)

(*) These  frequency data  are  based on the    6 million word  corpus
compiled by the Institut fuer Deutsche Sprache in Mannheim, Germany.

EXAMPLE DATA, GERMAN

An arbitrary query  using a small German  lemma lexicon (that  is, one
with very few columns) might yield the following result:

Headword      Pronunciation     Morphology:               M:  Cl Freq
                                Structured Segmentation   Cl
-----------   ----------------  ------------------------  --- -- ----
helfen        "hEl-f at n          (helf)                    V   V  1225
Helfer        "hEl-f at r          ((helf),(er))             Vx  N   134
hellaeugig    "hEl-Oy-gIx       ((hell),(Auge),(ig))      ANx A     0
hellblau      "hEl-blau         ((hell),(blau))           AA  A    28
Hellseher     "hEl-ze:- at r       (((hell),(seh)),(er))     AVx N    20
hellseherisch "hEl-ze:- at -rIS    (((hell),(seh)),(erisch)) AVx A     0
hellwach      "hEl-vax          ((hell),(((wach),(e))))   AVx A    13
Helm          "hElm             (Helm)                    N   N    22
Hund          "hUnt             (Hund)                    N   N   364
Huendchen     "hYnt-x at n         ((Hund),(chen))           Nx  N     7
hundekalt     "hUn-d at -kalt      ((Hund),(e),(kalt))       NxA A     0
hundemuede    "hUn-d at -my:-d@    ((Hund),(e),(muede))      NxA A     3
Hundeschnauze "hUn-d at -Snau-ts@  ((Hund),(e),(Schnauze))   NxN N     1
Hundesteuer   "hUn-d at -StOy-@r   ((Hund),(e),(Steuer))     NxN N     6
Hundewetter   "hUn-d at -vE-t@r    ((Hund),(e),(Wetter))     NxN N     0
Huendin       "hYn-dIn          ((Hund),(in))             Nx  N     7
huendisch     "hYn-dIS          ((Hund),(isch))           Nx  A     2
Huene         "hy:-n@           (Huene)                   N   N    13
huenenhaft    "hy:-n at n-haft     ((Huene),(n),(haft))      Nxx A    4
Hunger        "hU-N at r           (Hunger)                  N   N   102
Hungerkur     "hU-N at r-ku:r      ((Hunger),(Kur))          NN  N     5
Hungerlohn    "hU-N at r-lo:n      ((Hunger),(Lohn))         NN  N     6
hungern       "hU-N at rn          ((Hunger))                N   V    33
Hungersnot    "hU-N at rs-no:t     ((Hunger),(s),(Not))      NxN N    23
Hungerstreik  "hU-N at r-Straik    ((Hunger),((streik)))     NV  N    14

Richard Piepenbrock
CELEX Project Manager

                                                     C
           --   C E L E X   --
 -- The Centre for Lexical Information --                   C
                                               C        C      C
                                                    C
Max Planck Institute for Psycholinguistics                    C    CCCCCC
Wundtlaan 1                                              C     CCCCCCCCCCCCC
6525 XD  NIJMEGEN                     C           C    C     CCCCCCCCCCCCCCCC
The Netherlands                                          CCCCCCCCCC        CC
                                                   C    CCCCCCCC
Tel: (+31) (0)24 - 3615797                             CCCCCCCC
Fax: (+31) (0)24 - 3521213                             CCCCCCCC
                                                      CCCCCCCC
                                                      CCCCCCCC
E-mail:    celex at mpi.nl                               CCCCCCCC
                                                       CCCCCCCC
WWW-page:  http://www.kun.nl/celex/                    CCCCCCCC
                                                        CCCCCCCCC
                                                         CCCCCCCCCCC

------------------------------------------------------------------------
LINGUIST List: Vol-6-1712.