[Corpora-List] Google "region"-based searches (and BootCat)

Mark Davies Mark_Davies at byu.edu
Wed Nov 28 13:15:31 UTC 2012


>> By the way, if you want to use only Google results without crawling (BootCaT approach), you will have to pay substantial amounts of money, because they don't allow free API bulk requests anymore.

I've been running high frequency COCA 3-grams against Google for the last week or so, to create a 2-3 billion word corpus, and I've collected a bit more than 2,000,000 million URLs. 

Google does ask you to identify a Captcha-like word every 3-4 hours, but as long as that's not a problem (I have my program email me as soon as it gets redirected to that page) it works fairly well.

BTW, on BootCat, I was under the assumption that they were having trouble finding a search engine that allowed a sufficient number of queries (see http://listserv.linguistlist.org/cgi-bin/wa?A2=ind1204&L=CORPORA&P=R12047). Has this been solved? It looks like the limit is about 5,000 per month (see http://listserv.linguistlist.org/cgi-bin/wa?A2=ind1207&L=CORPORA&P=R455).

Anyway, with a bit of effort it is possible to (at least partially) circumvent the Google limits, to get several million URLs per month -- if that's the route one wants to go.

MD

============================================
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================

________________________________________
From: corpora-bounces at uib.no [corpora-bounces at uib.no] on behalf of Roland Schäfer [roland.schaefer at fu-berlin.de]
Sent: Wednesday, November 28, 2012 4:48 AM
To: corpora at uib.no
Subject: Re: [Corpora-List] Google "region"-based searches

I totally agree with what Adam Kilgarriff said: The problem is that
nobody would want to do research on the accuracy of any Google feature,
because they can change their algorithms at any moment without notice
and without documentation. Results are fundamentally invalid even before
you produce them.

By the way, if you want to use only Google results without crawling
(BootCaT approach), you will have to pay substantial amounts of money,
because they don't allow free API bulk requests anymore.

Whatever Google use: IP-based geolocation is totally unreliable as far
as language varieties are concerned. If you find a document from a
server located in Liverpool, are you going to treat the document as
necessarily (or even potentially) containing Scouse features? Also,
servers deliver different content based on undocumented mixes of various
headers sent by the requester, requester IP geolocation, etc. Thus, a
server located in London may deliver specialized content for US
visitors, potentially written for US visitors by US authors. Google or
any geolocator might even have classified the region of origin for some
document correctly, but your crawler gets a different redirect to a
different IP address. Automatic methods for large amounts of data will
most likely never deliver reliable region identification.

If you want to deal with regional varieties in web corpora, I think two
possible routes are: (1) Go for a small gold standard web corpus and try
to figure out the variety spoken by the writers manually for each
document. (2) Do more or less unselective crawls in the English-speaking
web and then see whether the documents you get look like what you
already know to be BrE and AmE, etc. Actually, top-level domains might
turn out to be sort of reliable in some cases (with an accent on "sort
of"), cf., e.g.:

@INPROCEEDINGS{Cook-Hirst2012,
  author = {Cook, Paul and Hirst, Graeme},
  title = {Do Web-Corpora from Top-Level Domains Represent National
    Varieties of {English}?},
  booktitle = {Proceedings of the 11th International Conference on the
    Statistical Analysis of Textual Data},
  year = {2012},
  address = {Liège},
}

Regards,
Roland

28.11.2012 11:16, corpora-request at uib.no skrev:
> Message: 3
> Date: Tue, 27 Nov 2012 14:34:10 +0000
> From: Mark Davies <Mark_Davies at byu.edu>
> Subject: [Corpora-List] Google "region"-based searches
> To: "corpora at hd.uib.no" <corpora at hd.uib.no>
>
> I'm looking at creating a corpus based on the web pages from a particular country, and I'd like to use Google's advanced search "region" field to limit the pages (https://www.google.com/advanced_search, see http://www.googleguide.com/sharpening_queries.html#region). Supposedly, this limits pages based on IP address, rather than just TLD (such as .sg or .sk).
>
> Has anyone heard how accurate this region field is? I'm wondering, because I'm seeing links to (for example) *.blogspot.com for region-based searches from countries other than the US (e.g. Singapore or Sri Lanka). In order for Google to be accurate in these cases, presumably there are servers for blogspot.com in these other countries (or any other domain), and as people from those countries create blogs they are stored on servers in that country, and then Google is recognizing their location by IP address, rather than just the domain. And the same would hold true for any US or UK-based domain that would return results from other countries.
>
> Thanks in advance,
>
> Mark Davies

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list