[Corpora-List] N-grams as seed for web-based corpora

egon w. stemle egon.stemle at unitn.it
Fri Jun 21 16:17:39 UTC 2013


hi mark,

the Italian PAISA corpus was partly constructed like this - and it is 
available at http://www.corpusitaliano.it/en/

"""
The PAISA documents were selected in two different ways. A part of the 
corpus was constructed using a method inspired by the WaCky project. We 
created 50,000 word pairs by randomly combining terms from an Italian 
basic vocabulary list (\url{http://ppbm.paravia.it/dib_lemmario.php}.), 
and used the pairs as queries to the Yahoo!~search engine 
(\url{http://developer.yahoo.com/boss/}) in order to retrieve candidate 
pages. We limited hits to pages in Italian with a Creative Commons 
license of type: CC-Attribution, CC-Attribution-Sharealike, 
CC-Attribution-Sharealike-Non-commercial, and 
CC-Attribution-Non-commercial. Pages that were wrongly tagged as 
CC-licensed were eliminated using a black-list that was populated by 
manual inspection of earlier versions of the corpus. The retrieved pages 
were automatically cleaned using the KrdWrd system 
(\url{https://krdwrd.org/}).

The remaining pages in the PAISA corpus come from the Italian versions 
of various Wikimedia Foundation projects, namely: Wikipedia, Wikinews, 
Wikisource, Wikibooks, Wikiversity, Wikivoyage. The official Wikimedia 
Foundation dumps were used, extracting text with Wikipedia Extractor 
(\url{http://medialab.di.unipi.it/wiki/Wikipedia_Extractor}).

Once all materials were downloaded, the collection was filtered 
discarding empty documents or documents containing less than 150 words.

The corpus contains approximately 390,000 documents coming from about 
1,000  different websites, for a total of about 250 million words.
Approximately 260,000 documents are from Wikipedia, approx. 5,600 from 
other Wikimedia Foundation projects. About 9,300 documents come from 
Indymedia, and we estimate that about 65,000 documents come from blog 
services.

Documents are marked in the corpus by an XML "text" tag with "id" and 
"url" attributes, the first corresponding to a unique numeric code 
assigned to each document, the second providing the original url of the 
document.
"""

-e.

On 06/21/2013 12:00 PM, corpora-request at uib.no wrote:
> Date: Thu, 20 Jun 2013 18:32:08 +0000
> From: Mark Davies <Mark_Davies at byu.edu>
> Subject: [Corpora-List] N-grams as seed for web-based corpora
> To: "corpora at uib.no" <corpora at uib.no>
>
> It seems that there are at least two ways to "seed" the Google/Bing/etc searches to create web-based corpora. One is to seed it with words from a particular domain (e.g. engineering or biology). Another option is to just run highly-frequent n-grams (e.g. Google n-grams or http://www.ngrams.info) against the search engines (e.g. "and from the", "but it is"), in the hope / expectation that these queries will return essentially "random" results from the search engine.
>
> This second approach (the "random" results via n-grams) is the approach that we used in the construction of the new 1.9 billion word Corpus of Global Web-based English (GloWbE; http://corpus2.byu.edu/glowbe/), which is tangentially related to an NSF-funded  project on web genres that Doug Biber and I are doing.
>
> As far as I know, this approach is also the one used in the creation of the Sketch Engine corpora (right, Adam?). Besides those corpora, however, I'm also interested in other web-based corpora that are based on this approach -- actual corpora (hopefully publicly accessible), and not just conference papers talking about how -- in theory -- this could be done.
>
> Thanks in advance,
>
> Mark D.


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list