27.2261, FYI: BYU Corpora: NOW, CORE, New Interface

The LINGUIST List via LINGUIST linguist at listserv.linguistlist.org
Wed May 18 15:16:07 UTC 2016


LINGUIST List: Vol-27-2261. Wed May 18 2016. ISSN: 1069 - 4875.

Subject: 27.2261, FYI: BYU Corpora: NOW, CORE, New Interface

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry, Robert Coté, Sara Couture)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
                   25 years of LINGUIST List!
Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Ashley Parker <ashley at linguistlist.org>
================================================================


Date: Wed, 18 May 2016 11:15:56
From: Mark Davies [mark_davies at byu.edu]
Subject: BYU Corpora: NOW, CORE, New Interface

 
New from BYU corpora (http://corpus.byu.edu)

1. New corpus interface (see http://corpus.byu.edu/updates2016.asp?c=n)

The new interface is much more mobile-friendly (smartphones and tablets); it
has a cleaner, simpler interface; more helpful ''context-sensitive'' help
files; and simpler, more intuitive search syntax (e.g. EAT * NOUN = eat the
cake, ate some strawberries; =expensive @CLOTHES = pricey coat, classy jeans).

The new interface also allows users to quickly and easily create and use
virtual corpora, such as texts from Cosmopolitan or Astronomy magazines
(COCA), texts dealing with the New Deal from 1932-1938 (COHA), or newspaper
articles from September 2015 dealing with the European refugee crisis (NOW).
Users can search within the virtual corpora, compare the frequency of words,
phrases, and constructions across their different virtual corpora, and quickly
and easily extract keywords from a virtual corpus.

2. NOW corpus (http://corpus.byu.edu/now). Nearly three billion word corpus
from 2010 through ... yesterday. Approximately 4 million words / 10,000
articles are added to the corpus every day (~125 million words a month, 1.5
billion words a year), which means that you're not limited to corpus results
from several years (or even decades) ago. 

The following are just a few examples of what the corpus can do. Click on the
''tour'' icon at the top of the page for many more examples.

With such an up-to-date corpus, you can look for neologisms like fracklog,
swatting, mommy porn, catfishing, trigger warning, and nomophobia; find words
occurring with digital NOUN or data NOUN; or substrings, such as *fest,
*sexual*, *phobia, *alypse, *geddon, or *ware (with frequency of words by year
or month).

You can also see the frequency of words and phrases by ''week'', to see when a
particular topic was discussed the most since 2010 (for example: Paris
attacks, Ashley Madison, or tsunami). You can also find the keywords for a
given day (including yesterday), or by week, month, or year. For example, you
can find the keywords for Apr 4 2016 (Panama papers: offshore, taxes) or Mar
22 2016 (Brussels airport: bomb, terrorists).

It is also possible to compare across different ''sections'' of the corpus --
either time or country. For example *gate (2015-2016 vs 2010-2011:
deflategate, deiselgate, etc), data NOUN (2015-2016 vs 2010-2011: data lake,
data grid), or ADJ collocates of Obama (2015-2016 vs 2010-2011). 

Finally, you can quickly and easily create and then use ''virtual corpora''.
For example, in just 5-10 seconds you could create a million word corpus based
on texts from September 2015 dealing with refugees in Europe. As is discussed
above in #1, you can then search within the virtual corpora, compare the
frequency of words and phrases across different virtual corpora, and generate
''keyword'' lists from a virtual corpus (e.g. asylum, war-torn, resettle).

3. CORE corpus (http://corpus.byu.edu/core). This corpus results from a grant
to Douglas Biber, Mark Davies, and Jesse Egbert from the US National Science
Foundation dealing with ''A Linguistic Taxonomy of English Web Registers''.
The corpus contains more than 50 million words of text from the web, and it
carefully categorizes the 50,000 texts into 30+ different web registers
(personal blogs, interviews, ''description with intent to sell'', ''how-to''
pages, sports reporting, etc). This is quite different from other very large
corpora that simply present huge amounts of data from web pages as giant
''blobs'', with no real attempt to categorize them into linguistically
distinct registers.

We hope that these new resources will be of value to you in your research and
teaching.

Best,

Mark Davies
BYU Corpora
--
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================
 



Linguistic Field(s): Computational Linguistics
                     Lexicography
                     Text/Corpus Linguistics

Subject Language(s): English (eng)

Language Family(ies): Indo-European





 



------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/

This year the LINGUIST List hopes to raise $79,000. This money 
will go to help keep the List running by supporting all of our 
Student Editors for the coming year.

Don't forget to check out Fund Drive 2016 site!

http://funddrive.linguistlist.org/

For all information on donating, including information on how to 
donate by check, money order, PayPal or wire transfer, please visit:
http://funddrive.linguistlist.org/donate/

The LINGUIST List is under the umbrella of Indiana University and
as such can receive donations through Indiana University Foundation. We
also collect donations via eLinguistics Foundation, a registered 501(c)
Non Profit organization with the federal tax number 45-4211155. Either
way, the donations can be offset against your federal and sometimes your
state tax return (U.S. tax payers only). For more information visit the
IRS Web-Site, or contact your financial advisor.

Many companies also offer a gift matching program, such that
they will match any gift you make to a non-profit organization.
Normally this entails your contacting your human resources department
and sending us a form that the Indiana University Foundation fills in
and returns to your employer. This is generally a simple administrative
procedure that doubles the value of your gift to LINGUIST, without
costing you an extra penny. Please take a moment to check if
your company operates such a program.


Thank you very much for your support of LINGUIST!
 


----------------------------------------------------------
LINGUIST List: Vol-27-2261	
----------------------------------------------------------







More information about the LINGUIST mailing list