21.5076, FYI: Free Access: Corpora in Catalan, Spanish, English

linguist at LINGUISTLIST.ORG linguist at LINGUISTLIST.ORG
Wed Dec 15 19:27:34 UTC 2010


LINGUIST List: Vol-21-5076. Wed Dec 15 2010. ISSN: 1068 - 4875.

Subject: 21.5076, FYI: Free Access: Corpora in Catalan, Spanish, English

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>
 
Reviews: Monica Macaulay, U of Wisconsin-Madison  
Eric Raimy, U of Wisconsin-Madison  
Joseph Salmons, U of Wisconsin-Madison  
Anja Wanner, U of Wisconsin-Madison  
       <reviews at linguistlist.org> 

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, 
and donations from subscribers and publishers.

Editor for this issue: Brent Miller <brent at linguistlist.org>
================================================================  

To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.cfm.

===========================Directory==============================  

1)
Date: 15-Dec-2010
From: Gemma Boleda [gemma.boleda at gmail.com]
Subject: Free Access: Corpora in Catalan, Spanish, English
 

	
-------------------------Message 1 ---------------------------------- 
Date: Wed, 15 Dec 2010 14:26:24
From: Gemma Boleda [gemma.boleda at gmail.com]
Subject: Free Access: Corpora in Catalan, Spanish, English

E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=21-5076.html&submissionid=3795466&topicid=6&msgnumber=1
  


Wikicorpus, v. 1.0: Catalan, Spanish and English portions of the Wikipedia.

The Wikicorpus contains portions of the Catalan, Spanish, and English
Wikipedias based on a 2006 dump. The corpora have been automatically tagged
with lemma and part of speech information using the open source library
FreeLing. Also, they have been WordNet-sense annotated with the state of
the art Word Sense Disambiguation algorithm UKB. In its current version,
the corpora have the following sizes:

* Catalan: around 50 million words
* Spanish: around 120 million words
* English: around 600 million words

We provide access to the corpora in their raw text and tagged versions,
under the same license as Wikipedia itself. To our knowledge, these are the
largest Catalan and Spanish corpora freely available for download. For more
information and download, please visit the project's page:

http://www.lsi.upc.edu/~nlp/wikicorpus 



Linguistic Field(s): Text/Corpus Linguistics

Subject Language(s): Catalan-Valencian-Balear (cat)
                     English (eng)
                     Spanish (spa)





 




-----------------------------------------------------------
LINGUIST List: Vol-21-5076	
----------------------------------------------------------


	



More information about the LINGUIST mailing list