<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div>Dear colleagues, </div><div><br></div><div>I'd like to use the Spanish portion of the Wikipedia corpora that were recently announced on this list (see below). Has anybody processed it with a standard NLP pipeline (tokenization, lemmatization, POS tagging would be enough for my purposes) and is willing to share the processed version? It'd save me quite some time. </div><div><br></div><div>Thank you, </div><div><br></div><div>Gemma Boleda. </div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div><div><div dir="ltr"><div><span style="font-family:arial,sans-serif;font-size:13px">1. Wikipedia Monolingual Corpora: more than 5 billion tokens of text in 23</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">languages extracted from the Wikipedia. The corpora are annotated with</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">article and paragraph boundaries, number of incoming links for each</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">article, anchor texts used to refer to each article (textlinks) and their</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">frequencies, crosslanguage links, categories and more (</span><br style="font-family:arial,sans-serif;font-size:13px"><a href="http://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/" style="font-family:arial,sans-serif;font-size:13px" target="_blank">http://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/</a><span style="font-family:arial,sans-serif;font-size:13px">). There</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">is also a script that allows to extract domain-specific sub-corpora if you</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">provide a list of desired categories.</span></div></div></div></div></div></blockquote><div><br></div></div><div><br></div>-- <br><div class="gmail_signature"><div dir="ltr">Gemma Boleda<br>Universitat Pompeu Fabra<div><a href="http://gboleda.utcompling.com" target="_blank">http://gboleda.utcompling.com</a><br><br><br></div></div></div>
</div></div>