[Corpora-List] Google "region"-based searches
Mike Maxwell
maxwell at umiacs.umd.edu
Wed Nov 28 22:58:51 UTC 2012
On 11/28/2012 8:40 AM, Mark Davies wrote:
> The other option, of course, is to use TLD (.lk for Sri Lanka, .sg for Singapore, .tz for
> Tanzania, etc), but limiting it this way *really* seems to degrade the "quality" of the web pages
> returned. Not as bad as if one were to limit US-based pages to .us -- where you get a lot of
> boring state and local government web pages -- but still not ideal. E.g., try limiting results
> for Tanzania to .tz or Sri Lanka to .lk -- my impression is that only a small percentage of all
> pages from that country have that TLD, and those pages may not be representative of the whole.
This is familiar. Eight or so years ago, we were looking for Tagalog pages, and briefly thought
about using these codes to confine our searches to the Philippines. Both precision and recall were
terrible: precision because there were lots of English-language websites in the Philippines (not to
mention lots of other Philippine languages), and recall for the reason given above.
I wrote some of this up in a paper given at ALLC/ ACH in 2004. It was however about finding web
pages in non-English languages, the methods probably wouldn't help if you're looking for dialectal
English.
--
Mike Maxwell
maxwell at umiacs.umd.edu
"My definition of an interesting universe is
one that has the capacity to study itself."
--Stephen Eastmond
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list