[lg policy] Google, Yahoo! BabelFish use math principles to translate documents online

Harold Schiffman haroldfs at GMAIL.COM
Tue Feb 22 22:06:03 UTC 2011


Google, Yahoo! BabelFish use math principles to translate documents online

By Konstantin Kakaes
Special to The Washington Post
Monday, February 21, 2011; 10:22 AM



Early one morning in 2007, Libby Casey was trying to do her laundry in
a guesthouse in Reykjavik, Iceland. When she couldn't figure out how
to use the washing machine, she opened up the instruction manual. The
guide was written in German, which Casey cannot read, so she typed
bits of it into an Internet translation tool. "It occurs nobody
endlschleudern, however, intercatapults" is one result she got.
Stumped, she pressed some buttons and eventually managed to wash her
clothes, in an elongated wash cycle that kept her pinned down for
three hours.

Libby's quandary will come as no surprise to anyone who has tried to
use a computer to translate things. For decades, machine translation
was mostly useful if you were trying to be funny. But in the last few
years, as anyone using Google Translate, Babel Fish or many other
translation Web sites can tell you, things have changed dramatically.
And all because of an effort begun in the 1980s to remove humans from
the equation.

As the late Frederick Jelinek, who pioneered work on speech
recognition at IBM in the 1970s, is widely quoted as saying: "Every
time I fire a linguist, my translation improves." (He later denied
putting it so harshly.)  Up to that point, researchers working on
machine translation used linguistic models. By getting a computer to
understand how a sentence worked grammatically in one language, the
thought was, it would be possible to create a sentence meaning the
same thing in another language. But the differing rules in different
languages made it difficult.

Jelinek and his group at IBM argued that by using statistics and
probability theory, instead of language rules, a computer could do a
better job of converting one language into another. Translation, they
basically argued, was as much a mathematical problem as a linguistic
one.
The computer wouldn't understand the meaning of what it was
translating, but by creating a huge database of words and sentences in
different languages, the computer could be programmed to find the most
common sentence constructions and alignment of words, and how these
were likely to correspond between languages. (Warren Weaver, a
mathematician at the Rockefeller Foundation, had first raised the idea
of a statistical model for translation in a 1947 letter in which he
wrote: "When I look at an article in Russian, I say: 'This is really
written in English, but it has been coded in some strange symbols.' ")

The IBM effort began with proceedings from the Canadian parliament,
which were published in English and French. "A couple guys drove to
Canada and left with two suitcases full of tapes that contained the
proceedings," says Daniel Marcu, co-founder of Language Weaver, the
first start-up to use the new statistical techniques in 2002.
Jelinek's group began by using a computer to automatically align
sentences in the French and English versions of the parliamentary
documents. It did this by pairing sentences from the same point in the
proceedings that were of roughly equal lengths. If an opening sentence
in English was 20 words long but the French opening was two sentences
of about 10 words, the computer would pair the English sentence with
the two French ones. The IBM researchers then used statistical methods
and deductions to identify sentence structures and groups of words
that were most common in the paired sentences.

As researchers got hold of more documents and translations of them in
different languages, the database of common words and groups of words
grew, providing increasing accuracy and nuance. This is the essence of
the system today.  Although the IBM group's initiative began more than
20 years ago, it has taken time for computer scientists at IBM and
elsewhere to refine those techniques, for computers to become powerful
enough to manage the complexity of the many linguistic probabilities
(such as multiword phrases and idioms) and for databases to grow large
enough - billions of words in various languages - to provide
translations nuanced enough to be usable. This is easier when dealing
with closely related languages, such as French and Spanish, and with
languages that have lots of translated documents with which to build a
database. European languages do well in computer translations in part
because the workings of the European Union must be published in the 23
"official and working languages" of the EU; these documents can then
be used as raw data for researchers.

A major step in computer translation occurred in 2007 - around the
time that Libby Casey was struggling with those Reykjavik washer
instructions - when Google introduced the first free, statistically
based translation software. (Other Web-based translation programs were
still using the older linguistic rule-based systems.)  "Suddenly we
see enormous progress in this technology because of Google's push,"
says Dimitris Sabatakakis, chief executive of Systran, one of the
oldest computer translation companies. (Systran powered Google
Translate until 2007 and is still the engine behind the widely known
Yahoo! Babel Fish computer translation service, which now uses a
hybrid system combining both statistical and linguistic models for
translation.)

All this means that someone such as Michael Cavendish, a lawyer based
Jacksonville, Fla., can do human-rights work related to China.
"Machine translation has been a godsend for someone like me who has
trouble conversing in foreign languages, because I never got a chance
to study them in depth," he said recently. When Cavendish writes
documents, e-mails or Twitter posts to communicate with dissidents and
others in Chinese, he finds that a computer translation is pretty good
- provided he keeps his English simple. So he doesn't go on about "ex
post facto laws," he said, but simply says: "China arrested this man
today for something that was legal yesterday."

After shunning linguistic system for many years, the statistical
translation mainstream is now again embracing grammar and other
language-specific rules to capture some nuances and improve accuracy.
Experts say that improvements in translation systems are only going to
continue as the databases they use grow larger and as computer
scientists are better able to incorporate linguistic information.
Soon, researchers say, there will be more and better "speech to
speech" software, which will allow simultaneous translation in
meetings, for instance. The Pentagon is particularly interested in
giving deployed soldiers the ability to communicate with locals: One
project is focusing on translations between English and Pashto, which
is spoken in Afghanistan and Pakistan.

Even as the field rapidly evolves, though, the kind of odd
translations that Libby Casey encountered doing her laundry in
Reykjavik are unlikely to vanish entirely - as Sandra Alboum recently
found out. Alboum, who runs a translation company in Arlington, was
perusing a manual for a half-million-dollar steel-manipulation machine
that a client of hers had translated, using a computer, from German
into English. "Do not step under floating burdens," it said. She had
to check the manual herself to figure out what was meant: "Do not
stand under suspended loads."

http://www.washingtonpost.com/wp-dyn/content/article/2011/02/21/AR2011022102191.html

-- 
=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

 Harold F. Schiffman

Professor Emeritus of
 Dravidian Linguistics and Culture
Dept. of South Asia Studies
University of Pennsylvania
Philadelphia, PA 19104-6305

Phone:  (215) 898-7475
Fax:  (215) 573-2138

Email:  haroldfs at gmail.com
http://ccat.sas.upenn.edu/~haroldfs/

-------------------------------------------------

_______________________________________________
This message came to you by way of the lgpolicy-list mailing list
lgpolicy-list at groups.sas.upenn.edu
To manage your subscription unsubscribe, or arrange digest format: https://groups.sas.upenn.edu/mailman/listinfo/lgpolicy-list



More information about the Lgpolicy-list mailing list