[Corpora-List] Google "region"-based searches

egon w. stemle egon.stemle at unitn.it
Wed Nov 28 14:17:20 UTC 2012


...not sure what you want to use the 'regional' feature for but - i 
might have an idea, and then - the following work might be of interest:

http://dl.acm.org/citation.cfm?id=2140536
Paddy WaC: a minimally-supervised web-corpus of Hiberno-English

http://www.cs.toronto.edu/~pcook/CookHirst2012.pdf
Do Web Corpora from Top-Level Domains Represent National Varieties of 
English?

http://sigwac.org.uk/raw-attachment/wiki/WAC7/wac7-proc.pdf#page=31
Using Web Corpora for the Recognition of Regional Variation in Standard 
German Collocations

...and there is an upcoming/unpublished work about:
StirWac - Compiling a web-based diverse corpus for South Tyrolean German 
considering genre

"""
...describe how we compiled a web-based corpus for South Tyrolean German 
and afterwards proceeded with a method trying to make it more diverse.

During the compilation of the corpus we had to face the problem that the
variety of our specialized corpus is not limited to one top-level domain 
on the
internet. Therefore we had to develop new strategies to narrow down our 
area of
search. We based our work on the BootCaT tool by \citet{BaroniB04a} and used
the web crawler Nutch developed by Apache additionally.  After the 
compilation
of the corpus we analysed its document distribution with a method 
suggested by
Serge Sharoff and tried to increase the 'weakly represented areas' with 
similar
documents again gained from the internet.
"""

if you find anything interesting, i'm happy to go into more details (at 
least with my own work...) -e.


On 2012-11-28 11:16, corpora-request at uib.no wrote:
> Date: Tue, 27 Nov 2012 14:34:10 +0000
> From: Mark Davies<Mark_Davies at byu.edu>
> Subject: [Corpora-List] Google "region"-based searches
> To:"corpora at hd.uib.no"  <corpora at hd.uib.no>
>
> I'm looking at creating a corpus based on the web pages from a particular country, and I'd like to use Google's advanced search "region" field to limit the pages (https://www.google.com/advanced_search, seehttp://www.googleguide.com/sharpening_queries.html#region). Supposedly, this limits pages based on IP address, rather than just TLD (such as .sg or .sk).
>
> Has anyone heard how accurate this region field is? I'm wondering, because I'm seeing links to (for example) *.blogspot.com for region-based searches from countries other than the US (e.g. Singapore or Sri Lanka). In order for Google to be accurate in these cases, presumably there are servers for blogspot.com in these other countries (or any other domain), and as people from those countries create blogs they are stored on servers in that country, and then Google is recognizing their location by IP address, rather than just the domain. And the same would hold true for any US or UK-based domain that would return results from other countries.
>
> Thanks in advance,
>
> Mark Davies
>
> ============================================
> Mark Davies
> Professor of Linguistics / Brigham Young University
> http://davies-linguistics.byu.edu/
>
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
> ============================================


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list