Corpora: Summary: German word lists

Stefan Thomas Gries StThGries at t-online.de
Wed Jul 12 12:19:08 UTC 2000


Dear colleagues

Recently I posted a query about where to download German word lists. I would
like to thank the following people (in alphabetical order) for their kind
assistance:

Anna Braasch
Damon Allen Davison
Pius ten Hacken
Agnes Muehlmeyer-Mentzel
Noemi Preissner
Markus Schulze

In what follows I provide a list of the sites that were pointed out to me
with some additional comments:

http://www.linguistik.uni-erlangen.de/LAPTDA/laptda.html
These wordlists were taken from seven corpora of the domains electronic data
processing, geography, law, medicine, sports, linguistics, economics and a
representative german corpus (LIMAS-corpus). Each of theses corpora contains
roughly 1.000.000 wordforms. Downloadable are:
o Frequency lists of morphemes, allomorphs, wordforms of the single corpora.
o so-called "n-domain-lists" of morphemes, allomorphs, wordforms:
n-domain-list: list of items that occured in n of the domain-specific
corpora mentioned above) eg.: the 2-domain-list of medicine and law contains
all     morphems / allomorphs / wordforms that occured in both corpora
together with their respective frequency information

http://www.loria.fr/~bonhomme/sw/
A useful collection of lists for French, English and German (large word
lists and smaller stop lists)

http://services.canoo.com/MorphologyBrowser.html
http://www.unibas.ch/LIlab/projects/wordmanager/wordmanager.html
They offer not only a list of word forms, but also a morphological analysis
module. In addition, word formation rules can be applied to recognise newly
coined compounds and derivations, which is not a trivial advantage in
German.

Finally, Agnes Muehlmeyer was so kind to let me have a 360,000 words word
list (generated on the basis of the German weekly newspaper Die Zeit (1986).


Apart from the above-mentioned sites directly concerned with word lists, I
was also directed to some sites with slightly different though related
contents:
http://www.kun.nl/celex/
http://www.ldc.upenn.edu/Catalog/LDC96L14.html
http://www.cis.uni-muenchen.de/projects/CISLEX.html


Once again, thanks to all contributors.

S t e f a n   T h .   G r i e s
----------------------------------------------------------------------------
B u e r o / O f f i c e :
Syddansk Universitet
Institut for Erhvervssproglig Informatik og Kommunikation
Grundtvigs Allé 150
6400 Sonderborg
Daenemark/Denmark



More information about the Corpora mailing list