[Corpora-List] 4.8 billion token Swedish web corpus available (SVCOW14)
Roland Schäfer
roland.schaefer at fu-berlin.de
Sat Aug 30 12:01:04 UTC 2014
* Apologies for multiple postings *
As the culmination of more than two years of work on the next generation
COW web corpora, a series of giga-token COWs in Dutch, English, French,
German, Spanish, Swedish is now leaving the processing tool chain. The
Swedish corpus is the first to become available. It is a 4.8 billion
token sentence shuffle corpus derived from an unshuffled 8.6 billion
token corpus. Next in line are (in this order) Dutch, English, German.
Website: http://hpsg.fu-berlin.de/cow/
Download: http://hpsg.fu-berlin.de/cow/download/
Web interface: http://hpsg.fu-berlin.de/cow/colibri/
SVCOW14AX maintainer: Roland Schäfer <mail at rolandschaefer.net>
COW initiative 2011-2014: Felix Bildhauer, Roland Schäfer
Best regards,
Roland
===== SUMMARY OF SVCOW14AX CORPUS PROPERTIES =====
* freely available under a restrictive academic license
* crawled in 2012 and 2014 in the TLDs .se and .fi
* vertical format with token/POS/lemma columns in minimal XML
* ready for encoding in versions of CWB which have UTF-8 support
* processed with texrex (http://texrex.sourceforge.net/) for:
+ markup stripping
+ UTF-8 transcoding and checking
+ entity conversion
+ heuristic repairs of broken encodings
+ document quality assessment using frequencies of short words:
Schäfer et al. (2013) [http://bit.ly/VSmK6M]
+ boilerplate status classification for text blocks:
Schäfer (2014, draft) [http://bit.ly/VSmK6M]
+ document de-duplication using classic w-shingling:
Schäfer & Bildhauer (2012) [http://bit.ly/1zJIqiT]
* run-together sentences fixed with rofl (included in texrex)
* hard-coded hyphenation removed with HyDRA (included in texrex)
* tokenization with ucto and custom scripts
* POS tagging with HunPos
* lemmatization with custom tools
* meta data encoded in the released version:
+ document ID
+ document URL
+ server geolocation from GeoLite by MaxMind (http://www.maxmind.com)
+ document quality score
+ boilerplate score
+ crawl date
+ last-modified (if available)
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list