[Corpora-List] N-grams as seed for web-based corpora

Thu Jun 20 18:32:08 UTC 2013

It seems that there are at least two ways to "seed" the Google/Bing/etc searches to create web-based corpora. One is to seed it with words from a particular domain (e.g. engineering or biology). Another option is to just run highly-frequent n-grams (e.g. Google n-grams or http://www.ngrams.info) against the search engines (e.g. "and from the", "but it is"), in the hope / expectation that these queries will return essentially "random" results from the search engine.

This second approach (the "random" results via n-grams) is the approach that we used in the construction of the new 1.9 billion word Corpus of Global Web-based English (GloWbE; http://corpus2.byu.edu/glowbe/), which is tangentially related to an NSF-funded  project on web genres that Doug Biber and I are doing.

As far as I know, this approach is also the one used in the creation of the Sketch Engine corpora (right, Adam?). Besides those corpora, however, I'm also interested in other web-based corpora that are based on this approach -- actual corpora (hopefully publicly accessible), and not just conference papers talking about how -- in theory -- this could be done.

Thanks in advance,

Mark D.

============================================
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora