[Corpora-List] Spanish corpus

Mark Davies Mark_Davies at byu.edu
Sat Oct 15 13:31:50 UTC 2005


>> I'm looking for a large morphologically annotated corpus of Spanish. 

>> Could anyone point me to any available resources?

You might try: http://www.corpusdelespanol.org/registers/

This is a web-based interface for a tagged 20 million word corpus of Spanish (nearly all 1970s-1990s),  The corpus is divided into three equally-sized sections -- spoken, fiction, and non-fiction.  Besides basic part of speech tagging, the the corpus was also tagged for nearly 150 different syntactic features (nominalizations, passives, clefts, etc).

Via the web-based interface, you can find the frequency of each of the ~150 features in 19 different registers (formal conversation, fiction, business letters, etc). Conversely, you can select any two of the 19 registers (e.g. sports broadcasts and encyclopedias) and find which syntactic features have the greatest degree of difference between these registers. In all cases, you can select the feature and register and then see a KWIC display of hits from the corpus.

In terms of background, the corpus comes largely from the 20 million words from the 1900s in the 100 million word Corpus del Espanol (www.corpusdelespanol.org). The tagging was done as part of an project to look at "multidimensional analysis of register variation in Spanish", which was funded by the US National Science Foundation, and which was carried out by Doug Biber (NAU) and Mark Davies (BYU).

=================================================
Mark Davies
Assoc. Prof., Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
================================================= 

 



More information about the Corpora mailing list