[Corpora-List] Resource release: Wikipedia corpora in Catalan, Spanish, and English

Lushan Han lushan1 at umbc.edu
Mon Dec 6 15:32:01 UTC 2010


Hi Gemma,

Thank you for releasing such a great corpus. However, the download link
http://www.lsi.upc.edu/~nlp/wikicorpus
is not working for now.
-- 
Best regards,
Lushan Han

PhD Student in Computer Science
University of Maryland, Baltimore County


On Mon, Nov 15, 2010 at 6:26 AM, Gemma Boleda <gboleda at lsi.upc.edu> wrote:

> Wikicorpus, v. 1.0: Catalan, Spanish and English portions of the Wikipedia.
>
> The Wikicorpus contains portions of the Catalan, Spanish, and English
> Wikipedias
> based on a 2006 dump. The corpora have been automatically tagged with lemma
> and
> part of speech information using the open source library FreeLing. Also,
> they have
> been WordNet-sense annotated with the state of the art Word Sense
> Disambiguation
> algorithm UKB. In its current version, the corpora have the following
> sizes:
>
> * Catalan: around 50 million words
> * Spanish: around 120 million words
> * English: around 600 million words
>
> We provide access to the corpora in their raw text and tagged versions,
> under the
> same license as Wikipedia itself. To our knowledge, these are the largest
> Catalan
> and Spanish corpora freely available for download. Moreover, we also
> provide an
> open source Java-based parser for Wikipedia pages developed for the
> construction
> of the corpus. For more information and download, please visit the
> project's page:
>
> http://www.lsi.upc.edu/~nlp/wikicorpus
>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20101206/8974f3ee/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list