[Corpora-List] Google's translations

Yorick Wilks Y.Wilks at dcs.shef.ac.uk
Sun Mar 14 22:41:41 UTC 2010


The reason pivoting through English works better may have nothing to do with the original route by which the documents were produced, but simply because the English language model is so much larger than any other then pivoting through it removes more ambiguities at that first (X-to-EN) stage which makes the second stage easier and more accurate.
Yorick Wilks


On 14 Mar 2010, at 22:27, Chris Dyer wrote:

>>> 3. Another interesting experiment is to let Google translate the German
>>> word "Ufer" (meaning "bank", but only in the waterside sense) into Czech.
>>> This gives "banky", which means "bank", but only in its financial sense.
>>> This can be explained by the observation that Google always uses English as
>>> interlingua (Ufer --> bank --> banky). If you directly translate e.g.
>>> Spanish to French you will get exactly the same result as when you first
>>> translate Spanish into English, and then translate the English output into
>>> French.
>>> Obviously, even for Google it is too costly to generate and maintain 52 *
>>> 51 = 2651 translation models for all the supported language pairs. Or is it
>>> that they have found that X to English to Y always performs better than X to
>>> Y because there is so much more data available between English and X or Y
>>> than between X and Y?
>> 
>> That is a fascinating observation. Conventional wisdom has it that going
>> through a pivot language is a
>> poor idea, but that does seem to be what is happening for French-Spanish.
>> Doubly weird because one would hope that the close family relation between
>> French and Spanish would  be helpful.
> 
> Some translation results using pivot languages turn out to be quite
> surprising (they were to me, at least).  It turns out that the optimal
> translation path between languages in a statistical system is probably
> a function of characteristics of the training data available to train
> the systems for individual language pairs.  See, for example, Section
> 6.1 in
> 
> 462 Machine Translation Systems for Europe, Philipp Koehn, Alexandra
> Birch and Ralf Steinberger, MT Summit XII, 2009
> http://www.mt-archive.info/MTS-2009-Koehn-1.pdf
> 
> Their statistical systems do better translating European legalese by
> pivoting through English than using more direct routes, presumably
> because the legalese training data was translated in this way (by
> humans). In other words, while there is presumably some good "direct"
> translation between closely related languages, it's not always
> learnable by statistical systems from the available training data.
> So, going through English may be a good idea, not just because it
> means you have to build fewer systems.
> 
> -Chris
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list