26.2114, FYI: COW: Free, Large Web Corpora in European Languages
The LINGUIST List via LINGUIST
linguist at listserv.linguistlist.org
Tue Apr 21 15:17:36 UTC 2015
LINGUIST List: Vol-26-2114. Tue Apr 21 2015. ISSN: 1069 - 4875.
Subject: 26.2114, FYI: COW: Free, Large Web Corpora in European Languages
Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry, Sara Couture)
Homepage: http://linguistlist.org
************* LINGUIST List 2015 Fund Drive *************
Please support the LL editors and operation with a donation at:
http://funddrive.linguistlist.org/
Editor for this issue: Ashley Parker <ashley at linguistlist.org>
================================================================
Date: Tue, 21 Apr 2015 11:15:29
From: Roland Schäfer [roland.schaefer at fu-berlin.de]
Subject: COW: Free, Large Web Corpora in European Languages
We would like to introduce the Corpora from the Web (COW) family of corpora created in an ongoing project at Freie Universität Berlin to a larger linguistic community. They are available in Dutch, English, German, Spanish, Swedish. A COW corpus of international French will be released in Q2/2015.
Interface for querying and download (after a quick FREE registration): https://webcorpora.org/
More information: http://corporafromtheweb.org/
Home: http://hpsg.fu-berlin.de/cow/
The corpora contain material collected from the web between 2011 and 2014. They are all several billion tokens (GT = giga-tokens) large:
- Dutch 4.7 GT
- English 9.6 GT
- German 11.7 GT
- Spanish 3.7 GT
- Swedish 4.8 GT
Since they are web-derived corpora, they contain standard language as well as language which is typical of computer-mediated communication. Because of the noisy nature of certain kinds of web data, we have invested several years into making them as usable as any traditionally compiled corpus. Also, the corpora were not simply collected from single top-level domains (like .de or .uk) and therefore contain international variants of the respective languages, with URL and meta data helping to identify specific regional variants.
All COW corpora have been automatically annotated with meta data such as download date, country and city of origin (using IP geolocation databases), overall document text quality, paragraph text quality. They all have part-of-speech and lemma annotation. Some of them are additionally annotated with dependency relations (English, Dutch, German and Swedish planned), contain named entity annotations (German), and morphological analyses of inflected forms (German, Spanish). Furthermore, the German corpus is annotated with metrics that help to find or filter documents written in a predominantly spontaneous register (usually encountered in forums and blog discussions).
We hope that many of you will find our corpora a valid source of data in morphology, syntax, graphemics, lexicography, and many other fields.
Felix Bildhauer, Roland Schäfer
Linguistic Field(s): Computational Linguistics
Text/Corpus Linguistics
----------------------------------------------------------
LINGUIST List: Vol-26-2114
----------------------------------------------------------
More information about the LINGUIST
mailing list