[Corpora-List] Corpus del Espa=?iso-8859-1?Q?=F1ol_?=Actual (CEA) / The Corpus of Contemporary Spanish

Carlos Subirats carlos.subirats at gmail.com
Thu Apr 26 06:44:27 UTC 2012


 <http://sfn.uab.es:9080/SFN/tools/cea/spanish>*Corpus del Español Actual
(CEA) / <http://sfncorpora.uab.es/CQPweb/cea/>The Corpus of Contemporary
Spanish <http://sfncorpora.uab.es/CQPweb/cea/>* (Powered by CQPweb)

The *Corpus del Español Actual <http://sfncorpora.uab.es/CQPweb/cea/>* (the
Corpus of Contemporary Spanish) contains *540 million words*, which have
been lemmatized and tagged with detailed part-of-speech information. The
CEA is made up of the following texts:

   - The Spanish part of the eleven-language parallel corpus Europarl:
   European Parliament Proceedings Parallel Corpus, v.
6<http://www.statmt.org/europarl/>(1996-2010);
   - The Spanish portion of the trilingual Wikicorpus, v.
1.0<http://www.lsi.upc.edu/%7Enlp/wikicorpus/>,
   which was extracted from a snapshot of Wikipedia (2006); and
   - The Spanish part of the seven-language parallel corpus MultiUN:
   Multilingual UN Parallel Text
2000-2009<http://www.euromatrixplus.net/multi-un/>,
   a corpus made up of the resolutions of the United Nations.

The CEA was tagged using an online Spanish
dictionary<http://sfn.uab.es:9080/SFN/tools/dictionary>containing
635,000 wordforms, which was automatically generated from a
dictionary of 86,000 single-word lemmas (e.g., *unir*,* inmoralidad*,* allí*)
and 26,000 multiword lemmas (e.g., *muerte cerebral*,* carga de profundidad*,
*de armas tomar*)* *(Subirats 1989, 1992, 1994a, 1994b;  Mogorrón 1994;
Garrido 1999; Bobes 2000). Tag disambiguation was carried out with
intersecting finite-state automata using lexical and syntactic information
(Subirats 1998, Subirats and Ortega 2000, 2001, Ortega in progress).

*Searching the CEA:*

The query interface for the CEA is
CQPweb<http://cwb.sourceforge.net/cqpweb.php>,
which uses some of the components of the IMS Open Corpus Workbench
(CWB)<http://cwb.sourceforge.net/>,
a set of open-source tools for managing and searching large corpora --
including the Corpus Query Processor (CQP). To learn more about how to use
CQPweb, you can consult the IMS's brief description of the regular-expression
syntax<http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQPSyntax.html>used
by the CQP and their list of sample
queries<http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQPExamples.html>.
If you wish to define your query in terms of grammatical and inflectional
categories, you can use the part-of-speech tags listed on the CEA's Corpus
Tags <http://sfn.uab.es:9080/SFN/tools/cea/corpus-tags> page.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120425/122e1191/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list