[Corpora-List] Google "region"-based searches

Mike Maxwell maxwell at umiacs.umd.edu
Wed Nov 28 22:58:51 UTC 2012


On 11/28/2012 8:40 AM, Mark Davies wrote:
> The other option, of course, is to use TLD (.lk for Sri Lanka, .sg for Singapore, .tz for
> Tanzania, etc), but limiting it this way *really* seems to degrade the "quality" of the web pages
> returned. Not as bad as if one were to limit US-based pages to .us -- where you get a lot of
> boring state and local government web pages -- but still not ideal. E.g., try limiting results
> for Tanzania to .tz or Sri Lanka to .lk -- my impression is that only a small percentage of all
> pages from that country have that TLD, and those pages may not be representative of the whole.

This is familiar.  Eight or so years ago, we were looking for Tagalog pages, and briefly thought 
about using these codes to confine our searches to the Philippines.  Both precision and recall were 
terrible: precision because there were lots of English-language websites in the Philippines (not to 
mention lots of other Philippine languages), and recall for the reason given above.

I wrote some of this up in a paper given at ALLC/ ACH in 2004.  It was however about finding web 
pages in non-English languages, the methods probably wouldn't help if you're looking for dialectal 
English.
-- 
	Mike Maxwell
	maxwell at umiacs.umd.edu
	"My definition of an interesting universe is
	one that has the capacity to study itself."
         --Stephen Eastmond

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list