26.2114, FYI: COW: Free, Large Web Corpora in European Languages

The LINGUIST List via LINGUIST linguist at listserv.linguistlist.org
Tue Apr 21 15:17:36 UTC 2015

LINGUIST List: Vol-26-2114. Tue Apr 21 2015. ISSN: 1069 - 4875.

Subject: 26.2114, FYI: COW: Free, Large Web Corpora in European Languages

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry, Sara Couture)
Homepage: http://linguistlist.org

*************    LINGUIST List 2015 Fund Drive    *************
Please support the LL editors and operation with a donation at:


Editor for this issue: Ashley Parker <ashley at linguistlist.org>

Date: Tue, 21 Apr 2015 11:15:29
From: Roland Schäfer [roland.schaefer at fu-berlin.de]
Subject: COW: Free, Large Web Corpora in European Languages

 We would like to introduce the Corpora from the Web (COW) family of corpora created in an ongoing project at Freie Universität Berlin to a larger linguistic community. They are available in Dutch, English, German, Spanish, Swedish. A COW corpus of international French will be released in Q2/2015.

Interface for querying and download (after a quick FREE registration): https://webcorpora.org/

More information: http://corporafromtheweb.org/

Home: http://hpsg.fu-berlin.de/cow/

The corpora contain material collected from the web between 2011 and 2014. They are all several billion tokens (GT = giga-tokens) large: 

- Dutch 4.7 GT
- English 9.6 GT
- German 11.7 GT
- Spanish 3.7 GT
- Swedish 4.8 GT

Since they are web-derived corpora, they contain standard language as well as language which is typical of computer-mediated communication. Because of the noisy nature of certain kinds of web data, we have invested several years into making them as usable as any traditionally compiled corpus. Also, the corpora were not simply collected from single top-level domains (like .de or .uk) and therefore contain international variants of the respective languages, with URL and meta data helping to identify specific regional variants.

All COW corpora have been automatically annotated with meta data such as download date, country and city of origin (using IP geolocation databases), overall document text quality, paragraph text quality. They all have part-of-speech and lemma annotation. Some of them are additionally annotated with dependency relations (English, Dutch, German and Swedish planned), contain named entity annotations (German), and morphological analyses of inflected forms (German, Spanish). Furthermore, the German corpus is annotated with metrics that help to find or filter documents written in a predominantly spontaneous register (usually encountered in forums and blog discussions).

We hope that many of you will find our corpora a valid source of data in morphology, syntax, graphemics, lexicography, and many other fields.

Felix Bildhauer, Roland Schäfer
Linguistic Field(s): Computational Linguistics
                     Text/Corpus Linguistics

LINGUIST List: Vol-26-2114	

More information about the LINGUIST mailing list