24.1694, FYI: New Corpus: GloWbE 1.9 Billion Words, 20 Countries
linguist at linguistlist.org
linguist at linguistlist.org
Tue Apr 16 15:05:37 UTC 2013
LINGUIST List: Vol-24-1694. Tue Apr 16 2013. ISSN: 1069 - 4875.
Subject: 24.1694, FYI: New Corpus: GloWbE 1.9 Billion Words, 20 Countries
Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>
Reviews: Veronika Drake, U of Wisconsin Madison
Monica Macaulay, U of Wisconsin Madison
Rajiv Rao, U of Wisconsin Madison
Joseph Salmons, U of Wisconsin Madison
Anja Wanner, U of Wisconsin Madison
<reviews at linguistlist.org>
Homepage: http://linguistlist.org
Do you want to donate to LINGUIST without spending an extra penny? Bookmark
the Amazon link for your country below; then use it whenever you buy from
Amazon!
USA: http://www.amazon.com/?_encoding=UTF8&tag=linguistlist-20
Britain: http://www.amazon.co.uk/?_encoding=UTF8&tag=linguistlist-21
Germany: http://www.amazon.de/?_encoding=UTF8&tag=linguistlistd-21
Japan: http://www.amazon.co.jp/?_encoding=UTF8&tag=linguistlist-22
Canada: http://www.amazon.ca/?_encoding=UTF8&tag=linguistlistc-20
France: http://www.amazon.fr/?_encoding=UTF8&tag=linguistlistf-21
For more information on the LINGUIST Amazon store please visit our
FAQ at http://linguistlist.org/amazon-faq.cfm.
Editor for this issue: Brent Miller <brent at linguistlist.org>
================================================================
Date: Tue, 16 Apr 2013 11:05:33
From: Mark Davies [mark_davies at byu.edu]
Subject: New Corpus: GloWbE 1.9 Billion Words, 20 Countries
E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=24-1694.html&submissionid=10977082&topicid=6&msgnumber=1
We have just released a new corpus at corpus.byu.edu, which may be of interest
to some of you:
GloWbE: Corpus of Global Web-Based English
http://corpus2.byu.edu/glowbe/
This new corpus is 1.9 billion words in size, and is based on 1.8 million web
pages (including blogs) from 20 different English-speaking countries (US, UK,
NZ, India, Hong Kong, etc). GloWbE is 4-5 times as large as COCA, and about 20
times as big as the BNC, and thus yields much richer data for some
low-frequency constructions.
The real power of GloWbE, though, is the ability to see the frequency of any
word, phrase, or grammatical construction in each of the 20 different
countries. You can also compare any features in two sets of dialects, such as
British and American English (in more than 775 million words of text for just
these two dialects). Or you could just limit your search to one or two
countries (e.g. Australia (148 million words), South Africa (45 million), or
Singapore (43 million)), and you'll still be searching the largest online
corpus for most of these twenty countries.
This new corpus of World English adds nicely to the other corpora from
corpus.byu.edu, which allow you to examine variation in English in ways that
are perhaps not possible with other corpora (see
http://corpus.byu.edu/variation.asp):
-- historical: COHA, TIME, COCA (recent change), Google Books (Advanced)
-- genres: COCA and BYU-BNC
-- dialects: GloWbE, and side-by-side comparisons of corpora
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/
Corpus design and use // Linguistic databases
Historical linguistics // Language variation
English, Spanish, and Portuguese
Linguistic Field(s): Computational Linguistics
Lexicography
Text/Corpus Linguistics
Subject Language(s): English (eng)
----------------------------------------------------------
LINGUIST List: Vol-24-1694
----------------------------------------------------------
More information about the LINGUIST
mailing list