[Corpora-List] the most user-friendly online corpora (monolingual and parrallel)

Mark Davies Mark_Davies at byu.edu
Wed Nov 11 01:58:39 UTC 2009


>> ... the BYU corpora. As one of those 43,000 users Mark is talking about, I can attest to its user-friendliness and usefulness

Thanks for the kind words, Linda. 

I'm sure that you've already heard from people associated with some of the following online corpus interfaces, but in case you haven't, you might also consider the English corpora from:

Sketch Engine: http://www.sketchengine.co.uk/
VISL: http://visl.sdu.dk/
PIE: http://pie.usna.edu/
BNCweb: http://bncweb.info/

>> The only thing you need to be a bit careful of with COCA is when comparing a lexical item's frequency by genre,  the "spoken" mainly comes from the news 

It is true that the spoken comes from *unscripted* conversation on TV and radio programs (on a really wide range of topics -- cooking demonstrations, celebrity interviews, politics, personal finance, relationship issues, etc etc etc). But how else could one create an 83 million word corpus of spoken English, with a staff of one person and a budget of $0 ? Nevertheless, these spoken texts still do quite a nice job of modeling contemporary spoken English. Some example of constructions in COCA that are much more common in spoken:

and I'm like , : http://www.americancorpus.org/?q=2299077
so not ADJ : http://www.americancorpus.org/x1.asp?q=2299079
I guess that : http://www.americancorpus.org/x1.asp?q=2299083
. Well , : http://www.americancorpus.org/x1.asp?q=2299085
. Sure . : http://www.americancorpus.org/x1.asp?q=2299086
, you know , : http://www.americancorpus.org/x1.asp?q=2299088

*Of course* these spoken transcripts don't model conversational spoken English as well as the BNC or the spoken corpora from the LDC, which were created by large teams of researchers with lots and lots of money. On the other hand, they are much larger in size -- eight times as large as the spoken part of the BNC and 30-40 times as large as the individual LDC corpora. They're also more recent. So it's a tradeoff. Anyway, for more information on the spoken texts in COCA, see [ More information / Spoken transcripts ] at the corpus website (www.americancorpus.org)

Mark D.

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: http://davies-linguistics.byu.edu
 
** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================


From: Linda Bawcom [linda.bawcom at sbcglobal.net]
Sent: Tuesday, November 10, 2009 5:30 PM
To: Mark Davies; Xiaotian Guo; Corpora list
Subject: Re: [Corpora-List] the most user-friendly online corpora (monolingual and parrallel)


Dear Xiaotian Guo,

Mark has been very modest regarding the BYU corpora. As one of those 43,000 users Mark is talking about, I can attest to its user-friendliness and usefulness. I have referred to it (COCA)  frequently for my doctoral thesis and when teaching. ( I even have it as a shortcut on my screen for quick access in class). The only thing you need to be a bit careful of with COCA is when comparing a lexical item's frequency by genre,  the "spoken" mainly comes from the news (so Mark's  BNC interface is perhaps a bit more reliable there).

Kindest regards,
Linda





From: Mark Davies <Mark_Davies at byu.edu>
To: Xiaotian Guo <garlickfred at gmail.com>; Corpora list <corpora at uib.no>
Sent: Tue, November 10, 2009 7:55:40 AM
Subject: Re: [Corpora-List] the most user-friendly online corpora (monolingual and parrallel)

Dear Xiaotian Guo,

You might look at the corpora from:

http://corpus.byu.edu

(English corpora = BYU-BNC, Corpus of Contemporary American English (COCA), TIME Corpus, etc).

You asked about who is using what. In terms of usage of these three corpora, in the past month there were:

43,000 distinct users for the Corpus of Contemporary American English
21,000 for BYU-BNC
4,000 for TIME

I believe that BYU-BNC is the most widely-used online interface for the BNC, but I can't prove this; I don't have full usage data from the creators of other BNC interfaces. COCA and the BNC were being used at about the same rate until 8-9 months ago, whereas COCA is now used about twice as much as the BNC.

>> corpora available online such as the Bank of English, the British National Corpus, the American National Corpus

I don't believe that the ANC is available via an online interface, and the full Bank of English online costs about $1150 per year (via Wordbanks Online: http://www.collinslanguage.com/wordbanks/subscribe/default.aspx).

I hope this helps.

Mark Davies

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906

http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================ 

From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Xiaotian Guo
Sent: Tuesday, November 10, 2009 6:38 AM
To: Corpora list
Subject: [Corpora-List] the most user-friendly online corpora (monolingual and parrallel)

Dear corpora Colleagues

I am teaching a group of students in SOAS Translation Technology course which involves the use of corpora for human translation as a Computer-Aided Translation means. I am aware there are many many monolingual and parallel corpora available online such as the Bank of English, the British National Corpus, the American National Corpus and a number of parallel corpora between various language pairs. But I feel different people may have strong preferences to particular corpora for their own use due to the corpora's user-friendliness (or availability perhaps). Could anyone be kind enough to let me know your favourites and perhaps why? 

My students are native speakers of Chinese, Japanese, Korean, Arabic and Persian.

If anyone is teaching a similar course which involves the use of corpora for human translation, please feel free to share your experiences with me. I would be very grateful.

All the best

Xiaotian Guo
SOAS
New Vision Language Centre



_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list