25.1708, FYI: Full-Text Corpora: Contemporary American English (COCA) and Global Web-Based English (GloWbE)
The LINGUIST List
linguist at linguistlist.org
Fri Apr 11 17:15:45 UTC 2014
LINGUIST List: Vol-25-1708. Fri Apr 11 2014. ISSN: 1069 - 4875.
Subject: 25.1708, FYI: Full-Text Corpora: Contemporary American English (COCA) and Global Web-Based English (GloWbE)
Moderators: Damir Cavar, Eastern Michigan U <damir at linguistlist.org>
Reviews: Monica Macaulay, U of Wisconsin Madison
Rajiv Rao, U of Wisconsin Madison
Joseph Salmons, U of Wisconsin Madison
Mateja Schuck, U of Wisconsin Madison
Anja Wanner, U of Wisconsin Madison
<reviews at linguistlist.org>
Homepage: http://linguistlist.org
Do you want to donate to LINGUIST without spending an extra penny? Bookmark
the Amazon link for your country below; then use it whenever you buy from
Amazon!
USA: http://www.amazon.com/?_encoding=UTF8&tag=linguistlist-20
Britain: http://www.amazon.co.uk/?_encoding=UTF8&tag=linguistlist-21
Germany: http://www.amazon.de/?_encoding=UTF8&tag=linguistlistd-21
Japan: http://www.amazon.co.jp/?_encoding=UTF8&tag=linguistlist-22
Canada: http://www.amazon.ca/?_encoding=UTF8&tag=linguistlistc-20
France: http://www.amazon.fr/?_encoding=UTF8&tag=linguistlistf-21
For more information on the LINGUIST Amazon store please visit our
FAQ at http://linguistlist.org/amazon-faq.cfm.
Editor for this issue: Uliana Kazagasheva <uliana at linguistlist.org>
================================================================
Date: Fri, 11 Apr 2014 13:15:22
From: Mark Davies [mark_davies at byu.edu]
Subject: Full-Text Corpora: Contemporary American English (COCA) and Global Web-Based English (GloWbE)
E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=25-1708.html&submissionid=30492557&topicid=6&msgnumber=1
At http://corpus.byu.edu/full-text/ you can now download full-text data for the following two corpora:
Corpus of Contemporary American English (COCA).
440 million words of downloadable text (190,000 separate texts). Balanced for genre — about 88 million words each of spoken, fiction, magazine, newspaper, and academic. With the included [sources] table, you can also search by sub-genre, e.g. News-Financial or Academic-Medicine.
The corpus of Global Web-Based English (GloWbE).
1.8 billion words of downloadable text (1,800,000 separate texts). Divided into groups from twenty different English-speaking countries (US, UK, Canada, Australia, India, etc). About 60% from blogs, for very informal language.
Of course with the full-text data from either corpus, you will have the actual corpora on your computer. As a result, you can do many things that would be difficult or impossible with the standard web interface, such as complex and time-consuming syntactic and semantic searches, sentiment analysis, topic modeling, named entity recognition, advanced regex searches, creating treebanks, and so on. You can also generate word frequency lists (e.g. top 100,000 words, by (sub-)genre), collocates (millions of pairs), and n-grams (hundreds of millions of strings).
The data comes in three different formats (see samples): data for relational databases (info), word/lemma/PoS (vertical), and linear text (horizontal). When you obtain the data, you have the rights to any and all of these formats.
Mark Davies
http://davies-linguistics.byu.edu/
Linguistic Field(s): Computational Linguistics
Lexicography
Text/Corpus Linguistics
Subject Language(s): English (eng)
------------------------------------------------------------------------------
This Year the LINGUIST List hopes to raise $75,000. This money will go to help keep the List running by supporting all of our Student Editors for the coming year.
See below for donation instructions, and don't forget to check out Fund Drive 2014 site!
http://linguistlist.org/fund-drive/2014/
There are many ways to donate to LINGUIST!
You can donate right now using our secure credit card form at https://linguistlist.org/donation/donate/donate1.cfm
Alternatively you can also pledge right now and pay later. To do so, go to: https://linguistlist.org/donation/pledge/pledge1.cfm
For all information on donating and pledging, including information on how to donate by check, money order, PayPal or wire transfer, please visit: http://linguistlist.org/donation/
The LINGUIST List is under the umbrella of Eastern Michigan University and as such can receive donations through the EMU Foundation, which is a registered 501(c) Non Profit organization. Our Federal Tax number is 38-6005986. These donations can be offset against your federal and sometimes your state tax return (U.S. tax payers only). For more information visit the IRS Web-Site, or contact your financial advisor.
Many companies also offer a gift matching program, such that they will match any gift you make to a non-profit organization. Normally this entails your contacting your human resources department and sending us a form that the EMU Foundation fills in and returns to your employer. This is generally a simple administrative procedure that doubles the value of your gift to LINGUIST, without costing you an extra penny. Please take a moment to check if your company operates such a program.
Thank you very much for your support of LINGUIST!
----------------------------------------------------------
LINGUIST List: Vol-25-1708
----------------------------------------------------------
More information about the LINGUIST
mailing list