[Corpora-List] Answers to domain corpora request

Carlos Rodriguez crodriguezp at gmail.com
Fri Apr 1 16:12:19 UTC 2005


Thanks to everyone who answer my request for open-source domain corpora.
Leonel Ruiz and Stella Tagnin pointed me to corpora in Spanish and
Brazilian Portuguese. For English, Ylva Berglund mentioned OPUS (an open
source parallel corpus). From the text mining front, big textual
collections of Bio-Medical full-text articles are now available, as
pointed out by Paul Buitelaar (http://muchmore.dfki.de/resources1.htm)
and Kevin Cohen (http://www.biomedcentral.com/info/about/datamining/
[8,000 plus articles in xml]), among other data collections. Also, the
Linux Documentation Project provides a quite big, typological
homogeneous collection.
Unfortunately, big textual collections from other disciplines are more
difficult to obtain in dowloadable form.  I am now compiling a 300
article collection from Sociology journals, in case anyone is also
interested in cross-genre comparatives  and lexical acquisition.

Carlos Rodríguez
National Autonomous University, Mexico



More information about the Corpora mailing list