[Corpora-List] Has anybody processed Linguatools' Spanish Wikipedia corpus?

Gemma Boleda gemma.boleda at upf.edu
Thu Dec 18 22:28:04 UTC 2014


Dear colleagues,

I'd like to use the Spanish portion of the Wikipedia corpora that were
recently announced on this list (see below). Has anybody processed it with
a standard NLP pipeline (tokenization, lemmatization, POS tagging would be
enough for my purposes) and is willing to share the processed version? It'd
save me quite some time.

Thank you,

Gemma Boleda.



> 1. Wikipedia Monolingual Corpora: more than 5 billion tokens of text in 23
> languages extracted from the Wikipedia. The corpora are annotated with
> article and paragraph boundaries, number of incoming links for each
> article, anchor texts used to refer to each article (textlinks) and their
> frequencies, crosslanguage links, categories and more (
> http://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/).
> There
> is also a script that allows to extract domain-specific sub-corpora if you
> provide a list of desired categories.
>


-- 
Gemma Boleda
Universitat Pompeu Fabra
http://gboleda.utcompling.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20141218/d811ab30/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list