[Corpora-List] language-specific harvesting of texts from the Web

Mark P. Line mark at polymathix.com
Mon Aug 30 20:51:02 UTC 2004


I've been playing with Google searches for extracting texts in a
particular language from the Web without a lot of noise (i.e. few texts
that aren't in the desired language). Any comments on the utility of this
approach for more serious corpus research? Any improvements to the best
search criteria I've been able to come up with below? Any good search
criteria for languages not listed?

(If there's any interest at all, I'd be happy to collect searches like
these on a webpage somewhere.)


Examples:

Basque:
http://www.google.com/search?q=gandik+gana&ie=utf-8&oe=utf-8

Bislama/Pijin:
http://www.google.com/search?q=blong+stap&ie=utf-8&oe=utf-8

Catalan:
http://www.google.com/search?q=els+uns+unes&ie=utf-8&oe=utf-8

Indonesian
http://www.google.com/search?q=tidak+yang+karena&ie=utf-8&oe=utf-8

Letzebuergesch:
http://www.google.com/search?q=fir+eng+dat&sourceid=mozilla-search&start=0&start=0&ie=utf-8&oe=utf-8

Malay:
http://www.google.com/search?q=tidak+yang+kerana&ie=utf-8&oe=utf-8

Malay/Indonesian:
http://www.google.com/search?q=tidak+yang&ie=utf-8&oe=utf-8

Mongolian:
http://www.google.com/search?q=%D0%B1%D0%B0%D0%B9%D0%BD%D0%B0+&ie=utf-8&oe=utf-8

Nahuatl:
http://www.google.com/search?q=auh+inic&ie=utf-8&oe=utf-8

North Frisian:
http://www.google.com/search?q=%C3%BC%C3%BCb+m%C3%A4+uun&ie=utf-8&oe=utf-8

Saami:
http://www.google.com/search?q=atte+son+ja+dat&ie=utf-8&oe=utf-8

Shona:
http://www.google.com/search?q=kusvika&ie=utf-8&oe=utf-8

Sorbian:
http://www.google.com/search?q=%C5%A1to%C5%BE&ie=utf-8&oe=utf-8

Swahili:
http://www.google.com/search?q=ya+ni+katika&sourceid=mozilla-search&start=0&start=0&ie=utf-8&oe=utf-8

Tagalog:
http://www.google.com/search?q=%22ang+mga%22&ie=utf-8&oe=utf-8

Tok Pisin:
http://www.google.com/search?q=long+bilong&&ie=utf-8&oe=utf-8

Welsh:
http://www.google.com/search?q=cymraeg+mae&ie=utf-8&oe=utf-8


-- Mark

Mark P. Line
Polymathix
San Antonio, TX



More information about the Corpora mailing list