[Corpora-List] Google's translations

Janne Bondi Johannessen jannebj at iln.uio.no
Mon Mar 15 00:18:26 UTC 2010


Version:1.0 StartHTML:0000000105 EndHTML:0000003864 StartFragment:0000002330
EndFragment:0000003828

Clearly the massive amount of English training data has certain extremely
unfortunate consequences for the choice of lexical items. It is amusing, but
could be potentially very confusing, when one currency is translated to
another (but not recalculated!). In the following test, I have used three
ways of writing the Norwegian currency (kroner). The first one is translated
to US dollars!

Olje koster kr 800 fatet. Gassen koster NOK 400 mens vindkraft koster seksti
kroner.

Oil costs U.S. $ 800 per barrel. The gas costs NOK 400, while wind power
costs sixty kroner. (Google translate)

 Janne Bondi Johannessen (who is still a Google Translate fan)

2010/3/15 Jimmy O'Regan <joregan at gmail.com>

> On 11 March 2010 13:18, Peter Kolb <pekoli at gmail.com> wrote:
> > 3. Another interesting experiment is to let Google translate the German
> word
> > "Ufer" (meaning "bank", but only in the waterside sense) into Czech. This
> > gives "banky", which means "bank", but only in its financial sense. This
> can
> > be explained by the observation that Google always uses English as
> > interlingua (Ufer --> bank --> banky). If you directly translate e.g.
> > Spanish to French you will get exactly the same result as when you first
> > translate Spanish into English, and then translate the English output
> into
> > French.
> > Obviously, even for Google it is too costly to generate and maintain 52 *
> 51
> > = 2651 translation models for all the supported language pairs. Or is it
> > that they have found that X to English to Y always performs better than X
> to
> > Y because there is so much more data available between English and X or Y
> > than between X and Y?
>
> Improving Word Alignment with Bridge Languages, Shankar Kumar, Franz
> Och, Wolfgang Macherey, Conference on Empirical Methods in Natural
> Language Processing and Computational Natural Language Learning, 2007.
> http://www.aclweb.org/anthology-new/D/D07/D07-1005.pdf
>
> '   We show that parallel corpora in multiple lan-
> guages can be exploited to improve the translation
> performance of a phrase-based translation system.
> This paper gives specific recipes for using a bridge
> language to construct a word alignment and for com-
> bining word alignments produced by multiple statis-
> tical alignment models.'
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



-- 
Janne Bondi Johannessen
Professor, The Text Laboratory, ILN, http://www.hf.uio.no/tekstlab/
President, NEALT, http://omilia.uio.no/nealt/
University of Oslo
P.O.Box 1102 Blindern, N-0317 Oslo, Norway
Tel: +47 22 85 68 14, mob.: +47 928 966 34
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100315/5ac6126a/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list