[Corpora-List] SdeWaC available

Gertrud Faaß faassg at uni-hildesheim.de
Wed Mar 21 14:10:10 UTC 2012


Dear all,
on the basis of the deWaC corpus (Baroni/Kilgarriff (2006)), the NLP institute of the University of Stuttgart (Institut für Maschinelle Sprachverarbeitung, IMS) and the Institute for Information Science and Natural Language Processing at the University of Hildesheim (Institut für Informationswissenschaft und Sprachtechnologie, IwiSt), created SdeWaC ("Stuttgart deWaC"). SdeWaC is a corpus created from a subset of the deWaC corpus. It contains about 44 million sentences and 884 million tokens. The sentences were selected on the grounds of being syntactically parsable with a standard dependency parser for German. A separate document (file "web-address-list.txt") contains the details of the URLs of the source texts.

Note that the corpus has been parsed at the University of Stuttgart with a state-of-the-art data-driven dependency parser (Bohnet (2010)), contact Wolfgang Seeker (seeker at ims.uni-stuttgart.de) for more information). The corpus is distributed as version 03, however, further cleaning is ongoing (expect version 04 to follow).

The Wacky-people were so nice to make it available, seehttp://wacky.sslmit.unibo.it/  for more details.

-- 
Gertrud Faaß
University of Hildesheim
Department of Information Science and
email gertrud.faass at uni-hildesheim.de



_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list