[Corpora-List] Google "region"-based searches

Mark Davies Mark_Davies at byu.edu
Wed Nov 28 13:40:00 UTC 2012


>> I don't think you can use single cases like this to make blanket
statements about the "total unreliability" of geolocation.  

>> For example, given a large enough
random sample of English texts written by people whose IPs resolve to
Ireland, could we not reasonably expect the distribution of language
varieties in those texts to roughly match that of the Irish population
in general, or at least that portion of it which is online?

I should have a 2-3 billion word corpus of English from about 20 different countries up and running in a couple of months. It's based on Google "region-based" queries (as per my original post). Once it's done, I'll look at some linguistic features where we know that a word or phrase X is much higher in country Y than in other countries, and see how well the region-based searches worked. I'll try to remember to reply back to CORPORA to let others know how it worked.

The other option, of course, is to use TLD (.lk for Sri Lanka, .sg for Singapore, .tz for Tanzania, etc), but limiting it this way *really* seems to degrade the "quality" of the web pages returned. Not as bad as if one were to limit US-based pages to .us -- where you get a lot of boring state and local government web pages -- but still not ideal. E.g., try limiting results for Tanzania to .tz or Sri Lanka to .lk -- my impression is that only a small percentage of all pages from that country have that TLD, and those pages may not be representative of the whole.

So while geolocation certainly isn't perfect, it doesn't look like a strictly TLD approach would be either.

Anyway, I'll report back on what I find with the Google region-based searches.

Mark Davies

============================================
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================

________________________________________
From: corpora-bounces at uib.no [corpora-bounces at uib.no] on behalf of Tristan Miller [miller at ukp.informatik.tu-darmstadt.de]
Sent: Wednesday, November 28, 2012 5:56 AM
To: Corpora List
Subject: Re: [Corpora-List] Google "region"-based searches

Greetings.

On 28/11/12 01:25 PM, Trevor Jenkins wrote:
> On 28 Nov 2012, at 11:48, Roland Schäfer <roland.schaefer at fu-berlin.de> wrote:
>
>> Whatever Google use: IP-based geolocation is totally unreliable as far
>> as language varieties are concerned.
>
> Definitely. My current ISP has various nodes connecting to the Internet.
> My connections appear to be in either Bangor in north Wales or in
> Winchester in southern England but never where I'm actually located.

I don't think you can use single cases like this to make blanket
statements about the "total unreliability" of geolocation.  Sure, the
user of any one IP can't be pinpointed with certainty to the nearest
square centimetre, but neither is geolocation totally random.  Were we
to analyze a large enough sample of geolocations, we could probably
conclude that m% of all IPs can be correctly resolved geographically to
within a n-kilometre radius.  For large enough areas (say, entire
countries) the accuracy of geolocation may be high enough for one's
purposes to make some informed estimates on the distribution of
coarse-grained language varieties.  For example, given a large enough
random sample of English texts written by people whose IPs resolve to
Ireland, could we not reasonably expect the distribution of language
varieties in those texts to roughly match that of the Irish population
in general, or at least that portion of it which is online?

Regards,
Tristan

--
Tristan Miller, Doctoral Researcher
Ubiquitous Knowledge Processing Lab (UKP-TUDA)
Department of Computer Science, Technische Universität Darmstadt
Tel: +49 6151 16 6166 | Web: http://www.ukp.tu-darmstadt.de/
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list