27.5018, FYI: Full-text Corpus Data: NOW, Wikipedia, Spanish
The LINGUIST List via LINGUIST
linguist at listserv.linguistlist.org
Thu Dec 8 18:05:17 UTC 2016
LINGUIST List: Vol-27-5018. Thu Dec 08 2016. ISSN: 1069 - 4875.
Subject: 27.5018, FYI: Full-text Corpus Data: NOW, Wikipedia, Spanish
Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté,
Michael Czerniakowski)
Homepage: http://linguistlist.org
***************** LINGUIST List Support *****************
Fund Drive 2016
25 years of LINGUIST List!
Please support the LL editors and operation with a donation at:
http://funddrive.linguistlist.org/donate/
Editor for this issue: Yue Chen <yue at linguistlist.org>
================================================================
Date: Thu, 08 Dec 2016 13:04:57
From: Mark Davies [mark_davies at byu.edu]
Subject: Full-text Corpus Data: NOW, Wikipedia, Spanish
We have just released four new downloadable full-text datasets from the BYU
corpora:
http://corpus.byu.edu/full-text/
These join the full-text data from COCA, COHA, and GloWbE, which have been
available for the last two years.
- Full-text data from the NOW corpus, which has 3.6 billion words of data,
and which is growing by 4-5 million words each day (130 million words each
month, 1.5 billion words each year)
- Full-text data from the Wikipedia corpus, which has 1.9 billion words in
4.4 million articles, and which can be used to extract data on a wide variety
of topics
- Update for the COCA data. The update has 70 million words from 2012-2015
(for a total of 520 million words in COCA), from spoken, fiction, magazines,
newspapers, and academic.
- Full-text data from the Corpus del Español (Web / Dialects), which has 2.0
billion words from 21 countries (60% of it informal blogs).
In all cases, the data for the downloadable corpora is available in three
different formats: basic linear text, word/lemma/PoS, and relational database,
and it contains more than 95% of the data from the online corpora.
============================================
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/
** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================
Linguistic Field(s): Computational Linguistics
Lexicography
Text/Corpus Linguistics
Subject Language(s): English (eng)
Spanish (spa)
------------------------------------------------------------------------------
***************** LINGUIST List Support *****************
Fund Drive 2016
Please support the LL editors and operation with a donation at:
http://funddrive.linguistlist.org/donate/
Thank you very much for your support of LINGUIST!
----------------------------------------------------------
LINGUIST List: Vol-27-5018
----------------------------------------------------------
More information about the LINGUIST
mailing list