[Corpora-List] New corpus: GloWbE -- 1.9 billion words, 20 countries

Mark Davies Mark_Davies at byu.edu
Mon Apr 15 20:52:40 UTC 2013


We have just released a new corpus at corpus.byu.edu<http://corpus.byu.edu/>, which may be of interest to some of you:



GloWbE: Corpus of Global Web-Based English<http://corpus2.byu.edu/glowbe/>



This new corpus is 1.9 billion words in size, and is based on 1.8 million web pages (including blogs) from 20 different English-speaking countries (US, UK, NZ, India, Hong Kong, etc). GloWbE is 4-5 times as large as COCA, and about 20 times as big as the BNC, and thus yields much richer data for some low-frequency constructions.



The real power of GloWbE, though, is the ability to see the frequency of any word, phrase, or grammatical construction in each of the 20 different countries. You can also compare any features in two sets of dialects, such as British and American English (in more than 775 million words of text for just these two dialects). Or you could just limit your search to one or two countries (e.g. Australia (148 million words), South Africa (45 million), or Singapore (43 million)), and you'll still be searching the largest online corpus for most of these twenty countries.



This new corpus of World English adds nicely to the other corpora from corpus.byu.edu, which allow you to examine variation<http://corpus.byu.edu/variation.asp> in English in ways that are perhaps not possible with other corpora:



-- historical: COHA, TIME, COCA (recent change), Google Books (Advanced)

-- genres: COCA and BYU-BNC

-- dialects: GloWbE, and side-by-side comparisons of corpora






============================================
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/
** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================





-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130415/243443c6/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list