Corpora: Summary: German word lists
Stefan Thomas Gries
StThGries at t-online.de
Wed Jul 12 12:19:08 UTC 2000
Dear colleagues
Recently I posted a query about where to download German word lists. I would
like to thank the following people (in alphabetical order) for their kind
assistance:
Anna Braasch
Damon Allen Davison
Pius ten Hacken
Agnes Muehlmeyer-Mentzel
Noemi Preissner
Markus Schulze
In what follows I provide a list of the sites that were pointed out to me
with some additional comments:
http://www.linguistik.uni-erlangen.de/LAPTDA/laptda.html
These wordlists were taken from seven corpora of the domains electronic data
processing, geography, law, medicine, sports, linguistics, economics and a
representative german corpus (LIMAS-corpus). Each of theses corpora contains
roughly 1.000.000 wordforms. Downloadable are:
o Frequency lists of morphemes, allomorphs, wordforms of the single corpora.
o so-called "n-domain-lists" of morphemes, allomorphs, wordforms:
n-domain-list: list of items that occured in n of the domain-specific
corpora mentioned above) eg.: the 2-domain-list of medicine and law contains
all morphems / allomorphs / wordforms that occured in both corpora
together with their respective frequency information
http://www.loria.fr/~bonhomme/sw/
A useful collection of lists for French, English and German (large word
lists and smaller stop lists)
http://services.canoo.com/MorphologyBrowser.html
http://www.unibas.ch/LIlab/projects/wordmanager/wordmanager.html
They offer not only a list of word forms, but also a morphological analysis
module. In addition, word formation rules can be applied to recognise newly
coined compounds and derivations, which is not a trivial advantage in
German.
Finally, Agnes Muehlmeyer was so kind to let me have a 360,000 words word
list (generated on the basis of the German weekly newspaper Die Zeit (1986).
Apart from the above-mentioned sites directly concerned with word lists, I
was also directed to some sites with slightly different though related
contents:
http://www.kun.nl/celex/
http://www.ldc.upenn.edu/Catalog/LDC96L14.html
http://www.cis.uni-muenchen.de/projects/CISLEX.html
Once again, thanks to all contributors.
S t e f a n T h . G r i e s
----------------------------------------------------------------------------
B u e r o / O f f i c e :
Syddansk Universitet
Institut for Erhvervssproglig Informatik og Kommunikation
Grundtvigs Allé 150
6400 Sonderborg
Daenemark/Denmark
More information about the Corpora
mailing list