[Corpora-List] Resource release: Wikipedia corpora in Catalan, Spanish, and English
Gemma Boleda
gboleda at lsi.upc.edu
Mon Nov 15 11:26:44 UTC 2010
Wikicorpus, v. 1.0: Catalan, Spanish and English portions of the Wikipedia.
The Wikicorpus contains portions of the Catalan, Spanish, and English Wikipedias
based on a 2006 dump. The corpora have been automatically tagged with lemma and
part of speech information using the open source library FreeLing. Also, they have
been WordNet-sense annotated with the state of the art Word Sense Disambiguation
algorithm UKB. In its current version, the corpora have the following sizes:
* Catalan: around 50 million words
* Spanish: around 120 million words
* English: around 600 million words
We provide access to the corpora in their raw text and tagged versions, under the
same license as Wikipedia itself. To our knowledge, these are the largest Catalan
and Spanish corpora freely available for download. Moreover, we also provide an
open source Java-based parser for Wikipedia pages developed for the construction
of the corpus. For more information and download, please visit the project's page:
http://www.lsi.upc.edu/~nlp/wikicorpus
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list