[Corpora-List] Summary - free online corpora

Thu May 31 14:36:20 UTC 2007

Many thanks to everyone who responded to my recent query about free online corpora. Here is a summary of the responses I have received: 

Jenny in Hong Kong directed me to the Hong Kong Polytechnic University's Virtual Language Centre http://vlc.polyu.edu.hk/, which takes you to a concordancer with different corpora.

Lene Petersen highlighted the KEMPE Korpus of Early Modern Playtexts in English which is available to search free of charge via http://corp.hum.sdu.dk/cqp.en.html. "The VISL site also hosts wikipedia and chat corpora that are password free."

Jörg Tiedemann pointed me to the OPUS collection of parallel corpora (including English). There is an on-line search interface at http://logos.uio.no/cgi-bin/opus/opuscqp.pl, and another (hidden) search interface for Europarl with some more features: http://logos.uio.no/opus/EUROPARL/frames-cqp.html

Elzbieta Dura mentioned http://bergelmir.iki.his.se/culler/ where there are a number of corpora in biomedicine and also an English-Swedish JRC-Acquis parallel corpus. At http://www.nla.se.culler there is a corpus of older English. She also noted that comments on the corpus tool Culler are welcome. 

Michaela Geierhos said: "Perhaps you are already aware of Mark Davies's TIME corpus. He provides an web interface to do basic KWIC, collocates, n-gram searches, etc. TIME corpus (new May 2007; 100m words; US 1900s) http://view.byu.edu/timemag. Another quite useful thing is GlossaNet. It's a search engine that gives you daily access to the online editions of more than 100 newspapers in 12 languages. http://glossa.fltr.ucl.ac.be/. It requires registration for intensive use because it's possible to get the concordances of all chosen newspapers daily or weekly etc. by e-mail. You can also take a look at the system before registering: http://glossa.fltr.ucl.ac.be/scripts/gtoday/gtoday.pl. There you'll see an overview of all accessible newspapers by language."

Eckhard Bick highlights the English section of Corpus Eye (at http://corp.hum.sdu.dk), which contains a number of further online corpora (all morphologically and syntactically annotated and searchable), of which the following are password-free: Europarl corpus (25.7 mill. words); Wikipedia corpus (115 mill. words); Chat corpus (23.5 mill. words); KEMPE Shakespeare corpus (8.9 mill. words); Enron e-mail corpus (75 mill. words)

Ana Frankenberg directed me to the COMPARA corpus, a 3 million-word bidirectional parallel corpus of English and Portuguese. "People can use just the English (or just the Portuguese) side of the corpus if they wish. The corpus is online, free and requires no registration. See http://www.linguateca.pt/COMPARA/Welcome.html"

Elisa Duarte Teixeira and Stella Tagnin told me that "the English part of the CorTec corpus, a Portuguse-English technical comparable corpus, which is part of the COMET Project (Multiligual Corpora for Teaching and Translation), can be freely searched at this address: (http://www.fflch.usp.br/dlm/comet/consulta_cortec.html). Although the English version of the site is not finished, there you'll find the documentation that explains the composition of the 5 corpora in English. Soon, all the 5 corpora will receive more texts and new areas will be added - we'll announce it here, when it's ready."  Stella Tagnin also pointed out a monolingual Brazilian Portuguese Corpus - Lácio-Web, at www.nilc.icmc.usp.br/lacioweb. 

Huaqing Hong suggested the SCoRE corpus at: http://score.crpp.nie.edu.sg/. You can register online to try the demo version. 

Ilya at the Linguistic Data Consortium directed me to: https://online.ldc.upenn.edu/login.html to sign up for a guest account to LDC Online. "With a guest account, you can search a subset of English newstext the LDC has acquired, as well as search and listen to English telephone conversations.  The American English Spoken Lexicon is also included."

Stefan Bordag suggested I look at corpora.uni-leipzig.de, which contains an English corpus as well as others and is freely accessible online, as well as downloadable. 

Ralf Steinberger highlighted the 55 million word English part of the multilingual parallel corpus JRC-Acquis. "The overall corpus, including all 22 languages, consists of over 1 Billion words. You cannot search the corpus via a web interface, but you can simply download the JRC-Acquis documents from the site http://langtech.jrc.it/JRC-Acquis.html."

For completeness, here are the corpora I included in my first message: 

BNC (http://www.natcorp.ox.ac.uk/)
VIEW interface to the BNC (http://view.byu.edu/)
COBUILD Corpus Concordance Sampler (http://www.collins.co.uk/corpus/CorpusSearch.aspx)
SCOTS (http://www.scottishcorpus.ac.uk)
ELISA (http://www.uni-tuebingen.de/elisa/html/elisa_index.html)
Compleat Lexical Tutor (access to Brown and BNC sampler among others) (http://www.lextutor.ca/)
Virtual Language Centre Web Concordancer (access to Brown, LOB among others) (http://www.edict.com.hk/default.htm)
IViE Corpus (http://www.phon.ox.ac.uk/IViE/)
Speech Accent Archive (http://accent.gmu.edu/)

thanks again!

Wendy
....................
Dr Wendy J Anderson
Scottish Corpus of Texts and Speech
Department of English Language
University of Glasgow
12 University Gardens
Glasgow
G12 8QQ
Scotland, UK

Website: http://www.scottishcorpus.ac.uk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070531/07809c41/attachment.htm>