27.5018, FYI: Full-text Corpus Data: NOW, Wikipedia, Spanish

The LINGUIST List via LINGUIST linguist at listserv.linguistlist.org
Thu Dec 8 13:05:17 EST 2016


LINGUIST List: Vol-27-5018. Thu Dec 08 2016. ISSN: 1069 - 4875.

Subject: 27.5018, FYI: Full-text Corpus Data: NOW, Wikipedia, Spanish

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté,
                                   Michael Czerniakowski)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
                   25 years of LINGUIST List!
Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Yue Chen <yue at linguistlist.org>
================================================================


Date: Thu, 08 Dec 2016 13:04:57
From: Mark Davies [mark_davies at byu.edu]
Subject: Full-text Corpus Data: NOW, Wikipedia, Spanish

 
We have just released four new downloadable full-text datasets from the BYU
corpora:

http://corpus.byu.edu/full-text/

These join the full-text data from COCA, COHA, and GloWbE, which have been
available for the last two years.

- Full-text data​ from the NOW corpus, which has 3.6 billion words of data,
and which is growing by 4-5 million words each day (130 million words each
month, 1.5 billion words each year)
- Full-text data from the Wikipedia corpus​, which has 1.9 billion words in
4.4 million articles, and which can be used to extract data on a wide variety
of topics
- Update for the COCA data. The update has 70 million words from 2012-2015
(for a total of 520 million words in COCA), from spoken, fiction, magazines,
newspapers, and academic.
- Full-text data from the Corpus del Español (Web / Dialects), which has 2.0
billion words from 21 countries (60% of it informal blogs).

In all cases, the data for the downloadable corpora​ is available in three
different formats: basic linear text, word/lemma/PoS, and relational database,
and it contains more than 95% of the data from the online corpora.

============================================
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================
 



Linguistic Field(s): Computational Linguistics
                     Lexicography
                     Text/Corpus Linguistics

Subject Language(s): English (eng)
                     Spanish (spa)





 



------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/

        Thank you very much for your support of LINGUIST!
 


----------------------------------------------------------
LINGUIST List: Vol-27-5018	
----------------------------------------------------------






More information about the LINGUIST mailing list