[Corpora-List] N-grams as seed for web-based corpora
egon w. stemle
egon.stemle at unitn.it
Fri Jun 21 16:17:39 UTC 2013
hi mark,
the Italian PAISA corpus was partly constructed like this - and it is
available at http://www.corpusitaliano.it/en/
"""
The PAISA documents were selected in two different ways. A part of the
corpus was constructed using a method inspired by the WaCky project. We
created 50,000 word pairs by randomly combining terms from an Italian
basic vocabulary list (\url{http://ppbm.paravia.it/dib_lemmario.php}.),
and used the pairs as queries to the Yahoo!~search engine
(\url{http://developer.yahoo.com/boss/}) in order to retrieve candidate
pages. We limited hits to pages in Italian with a Creative Commons
license of type: CC-Attribution, CC-Attribution-Sharealike,
CC-Attribution-Sharealike-Non-commercial, and
CC-Attribution-Non-commercial. Pages that were wrongly tagged as
CC-licensed were eliminated using a black-list that was populated by
manual inspection of earlier versions of the corpus. The retrieved pages
were automatically cleaned using the KrdWrd system
(\url{https://krdwrd.org/}).
The remaining pages in the PAISA corpus come from the Italian versions
of various Wikimedia Foundation projects, namely: Wikipedia, Wikinews,
Wikisource, Wikibooks, Wikiversity, Wikivoyage. The official Wikimedia
Foundation dumps were used, extracting text with Wikipedia Extractor
(\url{http://medialab.di.unipi.it/wiki/Wikipedia_Extractor}).
Once all materials were downloaded, the collection was filtered
discarding empty documents or documents containing less than 150 words.
The corpus contains approximately 390,000 documents coming from about
1,000 different websites, for a total of about 250 million words.
Approximately 260,000 documents are from Wikipedia, approx. 5,600 from
other Wikimedia Foundation projects. About 9,300 documents come from
Indymedia, and we estimate that about 65,000 documents come from blog
services.
Documents are marked in the corpus by an XML "text" tag with "id" and
"url" attributes, the first corresponding to a unique numeric code
assigned to each document, the second providing the original url of the
document.
"""
-e.
On 06/21/2013 12:00 PM, corpora-request at uib.no wrote:
> Date: Thu, 20 Jun 2013 18:32:08 +0000
> From: Mark Davies <Mark_Davies at byu.edu>
> Subject: [Corpora-List] N-grams as seed for web-based corpora
> To: "corpora at uib.no" <corpora at uib.no>
>
> It seems that there are at least two ways to "seed" the Google/Bing/etc searches to create web-based corpora. One is to seed it with words from a particular domain (e.g. engineering or biology). Another option is to just run highly-frequent n-grams (e.g. Google n-grams or http://www.ngrams.info) against the search engines (e.g. "and from the", "but it is"), in the hope / expectation that these queries will return essentially "random" results from the search engine.
>
> This second approach (the "random" results via n-grams) is the approach that we used in the construction of the new 1.9 billion word Corpus of Global Web-based English (GloWbE; http://corpus2.byu.edu/glowbe/), which is tangentially related to an NSF-funded project on web genres that Doug Biber and I are doing.
>
> As far as I know, this approach is also the one used in the creation of the Sketch Engine corpora (right, Adam?). Besides those corpora, however, I'm also interested in other web-based corpora that are based on this approach -- actual corpora (hopefully publicly accessible), and not just conference papers talking about how -- in theory -- this could be done.
>
> Thanks in advance,
>
> Mark D.
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list