[Corpora-List] re: pronunciation (caveat)
Gregor Erbach
gor at acm.org
Tue Jul 30 13:48:19 UTC 2002
Quoting Damon Allen Davison <linguist at socal.rr.com>:
> A caveat to all about relying too much on Google (and other search
> engines) for corpus research:
>
> Although Google allows you to define the page language for searches, it
> looks at ISO tags in the HTML source to determine this.
Not exclusively. Google also uses the document content for language
identification. Basis Technology (http://www.basistech.com/) claim
that Google is a user of their language identification software.
In WWW, the langauge can be specified in the HTML "lang" atttribute,
and in the HTTP 1.1 "content-language" response header.
> Many people who
> have their own web sites use software that by default inserts an
> English-language ISO tag into their source. Therefore, any spelling
> that happens to be a word in another language may indeed be written in
> another language, despite what the search engine claims.
I haven't found this to cause significant problems for
the Google langauge identifier.
regards,
Gregor Erbach
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dr. Gregor Erbach http://purl.org/net/gregor/
Saarland University http://www.uni-sb.de/
Computational Linguistics Dept. http://www.coli.uni-sb.de/
Project COLLATE http://collate.dfki.de/
Tel. +49 (681) 302-5354 mailto:gor at acm.org
More information about the Corpora
mailing list