[Corpora-List] Resource release: Wikipedia corpora in Catalan, Spanish, and English

Gemma Boleda gboleda at lsi.upc.edu
Mon Nov 15 11:26:44 UTC 2010


Wikicorpus, v. 1.0: Catalan, Spanish and English portions of the Wikipedia.

The Wikicorpus contains portions of the Catalan, Spanish, and English Wikipedias
based on a 2006 dump. The corpora have been automatically tagged with lemma and
part of speech information using the open source library FreeLing. Also, they have
been WordNet-sense annotated with the state of the art Word Sense Disambiguation
algorithm UKB. In its current version, the corpora have the following sizes:

* Catalan: around 50 million words
* Spanish: around 120 million words
* English: around 600 million words

We provide access to the corpora in their raw text and tagged versions, under the
same license as Wikipedia itself. To our knowledge, these are the largest Catalan
and Spanish corpora freely available for download. Moreover, we also provide an
open source Java-based parser for Wikipedia pages developed for the construction
of the corpus. For more information and download, please visit the project's page:

http://www.lsi.upc.edu/~nlp/wikicorpus



_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list