[Corpora-List] re: pronunciation (caveat)

Tue Jul 30 13:48:19 UTC 2002

Quoting Damon Allen Davison <linguist at socal.rr.com>:
> A caveat to all about relying too much on Google (and other search
> engines) for corpus research:
>
> Although Google allows you to define the page language for searches, it
> looks at ISO tags in the HTML source to determine this.

Not exclusively. Google also uses the document content for language
identification. Basis Technology (http://www.basistech.com/) claim
that Google is a user of their language identification software.

In WWW, the langauge can be specified in the HTML "lang" atttribute,
and in the HTTP 1.1 "content-language" response header.

> Many people who
> have their own web sites use software that by default inserts an
> English-language ISO tag into their source.  Therefore, any spelling
> that happens to be a word in another language may indeed be written in
> another language, despite what the search engine claims.

I haven't found this to cause significant problems for
the Google langauge identifier.

regards,

   Gregor Erbach

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dr. Gregor Erbach                  http://purl.org/net/gregor/
Saarland University                http://www.uni-sb.de/
Computational Linguistics Dept.    http://www.coli.uni-sb.de/
Project COLLATE                    http://collate.dfki.de/
Tel. +49 (681) 302-5354            mailto:gor at acm.org