[Corpora-List] Man bites dog

Noah A Smith nasmith at cs.cmu.edu
Mon Nov 21 12:50:47 UTC 2011


The phrase "pure statistical MT" is problematic.  Statistics are something
you calculate from data; statistical MT systems use those statistics to
decide how to translate.  The range of ways that can happen is wide, but
each and every possibility relies on some assumptions about how the symbols
get mapped and arranged, just like in a purely symbolic MT system.  Where
you draw the line between "pure" and "hybrid" is an arbitrary choice.

Even the "bag of words" model (I take this to mean something like IBM Model
1) makes such assumptions (most obviously, that words translate into other
words).  There are certainly SMT model / dataset combinations that could
get this right without having seen the exact string before, depending on
the relative importance given to matching the input ordering vs. the
language model.

For the record, Google's translation system gets this one right:
http://translate.google.com/#fr|en|homme%20mord%20chien

Noah
--
Noah Smith
Associate Professor
School of Computer Science
Carnegie Mellon University


On Mon, Nov 21, 2011 at 7:01 AM, Jimmy O'Regan <joregan at gmail.com> wrote:

> On 21 November 2011 03:15, Mike Maxwell <maxwell at umiacs.umd.edu> wrote:
> > In LILT 6 (http://elanguage.net/journals/index.php/lilt/issue/current),
> > "Zipf's Law and l'Arbitraire du Signe," Martin Kay discusses statistical
> MT,
> > and says (p.22):
> >
> >   Notice that a language model would, and should, guarantee
> >   that the French “homme mord chien” would be translated into
> >   English as “dog bites man”, rather than “man bites dog”,
> >   which is what it really means.
> >
> > I once proposed this exact example (with Spanish rather than French) to a
> > computational linguist who knew more about MT than I do.  (People who
> know
> > more about MT than I do are quite common.  Ok, they're quite common among
> > computational linguists :-).)  That person suggested I needed to learn
> more
> > about MT.
> >
> > It would be nice to find myself making the same mistake that Martin Kay
> > made.  It would be even nicer if it weren't a mistake.
> >
> > Is Kay's claim correct?  The context is of course pure statistical MT,
> not
> > hybrid rule/ statistical systems.  Assume that the pair "homme mord
> chien"/
> > "man bites dog" never occurs in the training data, but that the reverse
> does
> > (or at least that "dog bites man" appears on the English side, presumably
> > with some significant frequency).
>
> That idea overlooks how statistical reordering works, and assumes a
> 'bag of words' based method; it also presumes that the bigrams 'man
> bites' and 'bites dog' never occur. More importantly, it assumes that
> 'dog bites man' is a more frequent trigram in English (i.e., the
> target language model), which doesn't seem to be true
> (
> http://books.google.com/ngrams/graph?content=man+bites+dog%2C+dog+bites+man&year_start=1800&year_end=2000&corpus=0&smoothing=3
> ):
> which makes sense in hindsight, when you consider the idiomatic value
> of 'man bites dog'.
>
> It has a sort of metaphorical truth, regarding SMT's difficulties with
> novelty, but it's not literally true - file it away with 'the meat is
> rotten, but the vodka is good' :).
>
> --
> <Sefam> Are any of the mentors around?
> <jimregan> yes, they're the ones trolling you
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20111121/9a749eb7/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list