[Corpora-List] British / American corpora

Mark Davies Mark_Davies at byu.edu
Thu Nov 9 14:23:19 UTC 2006


>> I find it fascinating that as soon as we find a linguistic topic which
sparks the interest of everyone here, the discussion suddenly makes
hardly any reference to corpora. Why are suddenly anecdotes, intuitions,
folk theories and made-up examples preferable to consulting corpora?

>> It's a serious question. It seems to me reasonable to bring in these
other factors and pieces of evidence to inform a discussion about corpus
linguistics, but why is almost no-one consulting a corpus, or consulting
research papers based on corpora? Lack of resources? Lack of tools?
Don't think that use of corpora is appropriate for this question?

I think that it is mainly due to a lack of large, comparable corpora of British and American English. While we have Brown/LOB and FROWN/FLOB, etc, these are much too small for many types of studies. And in spite of work on the ANC, my feeling is that (at 22m words) it is still not at the point where -- for many types of investigations -- it would allow useful comparisons with the BNC.
 
My sense is that we need more large corpora that are explicitly designed to allow comparison of British and American English (in addition to other varieties). Along these lines, I've recently applied for a grant from the US National Endowment for the Humanities to create a 200 million word corpus of English, 1500s-1900s (30m each century 1500s-1700s, 50m 1800s, 60m 1900s). The 1900s portion will have 6 million words from each decade, and will be balanced for spoken, fiction, news, and academic. The architecture and interface will be similar to that of my VIEW interface to the BNC (http://view.byu.edu) and the recently-completed Corpus do Portugues (http://www.corpusdoportugues.org). It will be tagged for part of speech and lemma with help from the creators of tagger for the BNC.
 
In terms of British and American English, for the late 1700s through the 1900s it will be about 45% British and 45% American English, with about 10% from other dialects. If funded, then, this corpus should allow for some nice comparisons between these different varieties of of English.
 
Best,
 
Mark Davies
 
=================================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
================================================= 

 



More information about the Corpora mailing list