[Corpora-List] 4.7 billion token Dutch web corpus available (NLCOW14)
Roland Schäfer
roland.schaefer at fu-berlin.de
Sun Sep 7 01:05:34 UTC 2014
* Apologies for multiple postings *
As the culmination of more than two years of work on the next generation
COW web corpora, a series of giga-token COWs in Dutch, English, French,
German, Spanish, and Swedish is now leaving the processing tool chain.
The Dutch corpus is the second to become available. It is a 4.7 billion
token sentence shuffle corpus derived from an unshuffled 6.9 billion
token corpus. Next in line are (in this order) English, German, Spanish,
French.
Website: http://hpsg.fu-berlin.de/cow/
Download: http://hpsg.fu-berlin.de/cow/download/
Simple web interface: http://hpsg.fu-berlin.de/cow/colibri/
NLCOW14AX maintainer: Enrique Manjavacas <enrique.manjavacas at gmail.com>
COW initiative 2011-2014: Felix Bildhauer, Roland Schäfer
Best regards,
Enrique
Roland
===== SUMMARY OF CORPUS PROPERTIES =====
* crawled in 2012 and 2014 in the TLDs .nl and .be
* processed with texrex (http://texrex.sourceforge.net/) for:
+ markup stripping
+ UTF-8 transcoding and checking
+ entity conversion
+ heuristic repairs of broken encodings
+ document quality assessment using frequencies of short words:
Schäfer et al. (2013) http://bit.ly/VSmK6M
+ boilerplate status classification for text blocks:
Schäfer (2014, draft) http://bit.ly/VSmK6M
+ document de-duplication using classic w-shingling:
Schäfer & Bildhauer (2012) http://bit.ly/1zJIqiT
* run-together sentences fixed with rofl (included in texrex)
* hard-coded hyphenation removed with HyDRA (included in texrex)
* tokenization with ucto and custom scripts
* POS tagging and lemmatization with TreeTagger
* meta data in the released sentence shuffle version:
+ document ID
+ document URL
+ server geolocation from GeoLite by MaxMind (http://www.maxmind.com)
+ document quality score
+ boilerplate score
+ crawldate
+ last-modified (if available)
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list