29.2151, FYI: The new 14 billion word iWeb corpus (BYU corpora)

The LINGUIST List linguist at listserv.linguistlist.org
Thu May 17 17:35:34 UTC 2018


LINGUIST List: Vol-29-2151. Thu May 17 2018. ISSN: 1069 - 4875.

Subject: 29.2151, FYI: The new 14 billion word iWeb corpus (BYU corpora)

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté,
                                   Michael Czerniakowski)
Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Kenneth Steimel <ken at linguistlist.org>
================================================================


Date: Thu, 17 May 2018 13:35:18
From: Mark Davies [mark_davies at byu.edu]
Subject: The new 14 billion word iWeb corpus (BYU corpora)

 
We have just released the new 14 billion word iWeb corpus:

https://corpus.byu.edu/iweb/

iWeb complements other BYU corpora (https://corpus.byu.edu) such as COCA,
COHA, NOW, BYU-BNC, GloWbE, Wikipedia, and EEBO.
 
At 14 billion words, iWeb is more than 25 times as large as the 560 million
word COCA corpus. iWeb also has a much wider range of web-based materials than
does COCA, since it is based on 22 million web pages in nearly 100,000
carefully selected websites (based on Alexa.com, from Amazon).
 
New in iWeb is the ability to browse through the top 60,000 words in the
corpus, and to search this list by word form, part of speech, rank
(#1-60,000), and even pronunciation.
 
Most importantly, you can then see detailed information on each of the top
60,000 words in the corpus – definition, frequency information, synonyms and
other related words (from WordNet, word families, MRC, etc), collocates (in a
much improved format), related “topics” (perhaps much more useful than
collocates), “clusters” (new in iWeb), relevant websites, and sample
concordance/KWIC lines. Extensive hyperlinks allow you to easily and quickly
move from one word to a number of related words.
 
In addition, for each of these 60,000 words, there are “quick links” to
related data from other websites – pronunciation, additional definitions,
images, videos, and translations (for more than 100 languages).
 
iWeb also allows you to quickly and easily create “virtual corpora” on nearly
any topic, and these virtual corpora can then be searched as their own
“stand-alone” corpora, or compared to other virtual corpora that you have
created.

Finally, in terms of “standard” corpus searches, we note that (due to
improvements in the corpus architecture) iWeb is faster than any of the other
BYU corpora, and it is typically much faster than other large, 10-20 billion
word online corpora. 
 
For a short overview of the corpus (in graphical format, with an emphasis on
the new features), please see:
 
https://corpus.byu.edu/iweb/help/iweb_overview.pdf
 
We hope that this new corpus is useful to you in your teaching, learning, and
research.
 
Best,
 
Mark Davies
BYU Corpora
 
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/
 Corpus design and use // Linguistic databases 
 Historical linguistics // Language variation 
 English, Spanish, and Portuguese 
============================================
 



Linguistic Field(s): Computational Linguistics
                     Lexicography
                     Text/Corpus Linguistics

Subject Language(s): English (eng)





 



------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:

              The IU Foundation Crowd Funding site:
       https://iufoundation.fundly.com/the-linguist-list

               The LINGUIST List FundDrive Page:
            http://funddrive.linguistlist.org/donate/
 


----------------------------------------------------------
LINGUIST List: Vol-29-2151	
----------------------------------------------------------






More information about the LINGUIST mailing list