Corpora: German corpus

LDC Office ldc at unagi.cis.upenn.edu
Wed Mar 29 19:24:26 UTC 2000


Dear Christina,

The Linguistic Data Consortium (LDC) offers a variety of German
corpora.  We have telephone speech, transcripts, lexicons, and
newswire text.

We have two telephone speech collections, CallFriend and CallHome.
The CallFriend collection consists of 60 unscripted telepone
conversations lasting between 5 and 30 minutes.  The CallHome
collection consists of 100 telephone conversations lasting up to 30
minutes each.  Transcripts of the CallHome calls are available as
is a lexicon.

The CallHome German lexicon consists of 318,807 words and contains
tab-separated information fields with orthographic, morphological,
phonological, stress, source, and frequency information for each
word.  315,503 words from the CallHome German lexicon are adapted
from the CELEX German lexicon produced by The Centre for Lexical
Information, which is also distributed through LDC.  Celex contains
information on orthography, phonology, morphology, syntax, and word
frequency.

We have two corpora which contain German newstext.  ECI
Multilingual Text consists of roughly 92 million words from 27
languages.  It contains roughly 36 million words in German from
various news sources.  The European Language Newspaper Text
collection includes roughly 100 million words of French, 90 million
words of German and 15 million words of Portuguese.  Our newstext
collections our marked using SGML to identify article boundaries.

For more information on these corpora please visit our Catalog
search page at

http://morph.ldc.upenn.edu/Catalog/search.html

and select the language and/or corpus type in which you are
interested.  Please feel free to contact me with any questions.

Best regards,

Shannon Sears
Manager, Intellectual Property Rights and Membership
----------------------------------------------------------------------
Linguistic Data Consortium          Phone: (215) 898-0464
3615 Market Street                  Fax:   (215) 573-2175
Suite 200                           email: ssears at ldc.upenn.edu
Philadelphia, PA 19104-2608         www: http://www.ldc.upenn.edu




Christina Rosén wrote:

> HEllo,
>
> I am doing research on second language acquisation. Could someone tell me
> if there is an adequat German corpus available somewhere. Most corpora seem
> to be English!
> I would be very grateful for help. Thanks!
>
> Best regards
> Christina Rosén
> Växjö university
>
> ----------------------------------
> Christina Rosén
> Inst. för humaniora
> Växjö universitet
> 351 95 Växjö
>
> Phone  +46 470 70 88 55
> Fax    +46 470 75 18 88
> Phone/Fax +46 470 124 27 (home)



More information about the Corpora mailing list