26.584, FYI: Wikipedia Corpus (1.9 bil. words; virtual corpora)

The LINGUIST List via LINGUIST linguist at listserv.linguistlist.org
Wed Jan 28 16:01:12 UTC 2015


LINGUIST List: Vol-26-584. Wed Jan 28 2015. ISSN: 1069 - 4875.

Subject: 26.584, FYI: Wikipedia Corpus (1.9 bil. words; virtual corpora)

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry, Sara Couture)
Homepage: http://linguistlist.org

Editor for this issue: Uliana Kazagasheva <uliana at linguistlist.org>
================================================================


Date: Wed, 28 Jan 2015 11:00:28
From: Mark Davies [mark_davies at byu.edu]
Subject: Wikipedia Corpus (1.9 bil. words; virtual corpora)

 
New BYU Wikipedia Corpus (with virtual corpora):
http://corpus.byu.edu/wiki/

Overview and YouTube tutorials:
http://corpus.byu.edu/wikipedia.asp

We have just recently released the BYU (Brigham Young University) Wikipedia
Corpus, which is composed of 1.9 billion words in 4.4 million articles. With
this new corpus, you can now search Wikipedia in all of the ways that you can
search the other corpora​ from BYU (http://corpus.byu.edu) - word and phrase,
part of speech, variable strings, synonyms, comparisons of words, collocates,
and concordance lines.

Most importantly, however, with this interface you can quickly and easily
create and then search personalized ''virtual corpora'' from the 4,400,000 web
pages. For example, in just a few seconds you could create a corpus with
500-1,000 pages (perhaps 500,000-1,000,000 words) related to microbiology,
economics, basketball, Buddhism, or thousands of other topics. You can also
modify any of these corpora - adding, deleting, or moving texts; creating
groups of corpora, etc.

You can then limit your search to just that portion of Wikipedia, to see
collocates or concordance lines from just that virtual corpus. You can also
compare the frequency of words and phrases across these different virtual
corpora, or find which of the 4.4 million pages use a given word or phrase the
most (and then create a customized corpus from those results).

And perhaps best of all, you can quickly and easily create keyword lists for
these corpora, including multi-word expressions. So if you are studying or
teaching finance, for example, you can quickly create a customized ''finance''
corpus, and then find keywords (e.g. nouns, verbs, adjectives, noun+noun)
related to this topic. And you can then see many examples of these words or
phrases in context.

So rather than having to scour the Web to find web pages for a corpus on a
given topic, you can now just create a corpus from the relevant pages in
Wikipedia. And then use the data from the new Wikipedia corpus to focus in on
the words and phrases of that particular topic.

We hope that this new corpus is of use to you in your teaching and research.
 



Linguistic Field(s): Lexicography
                     Text/Corpus Linguistics

Subject Language(s): English (eng)





 






----------------------------------------------------------
LINGUIST List: Vol-26-584	
----------------------------------------------------------







More information about the LINGUIST mailing list