[Corpora-List] Word Alignment does exist and goes well: A summary

Joerg Tiedemann jorg.tiedemann at lingfil.uu.se
Sat Jun 4 17:51:29 UTC 2011


Just to make your list of applications a bit more complete:

* word alignment can also be useful for lexicon extraction and rule
induction in non-statistical machine translation
* it can be used for chunk alignment in example based MT
* word alignment is used for the extraction of domain-specific translation
of terminology (mostly for computer-aided translation)
* word alignment has been used for word sense disambiguation/discrimination
(using translations as "semantic mirrors" with different lexical
ambiguities)
* it can be used to extract (WordNet-like) lexico-semantic relations (in one
language and/or across languages)
* word alignment can be applied to find (monolingual) term variations which
has been used for query expansion in IR and QA
* the extraction of paraphrases is another application where word alignment
has been used
* interestingly enough, the limitations of automatic word alignment can also
be used to identify non-compositional (idiomatic) expressions

I could give you a lot of pointers if you like.

Clearly, there is some confusion about the term "alignment" (which is not
used as a monotonic, complete, one-to-one mapping in word alignment) and
automatic word alignment is certainly very noisy, so that the alignment is
usually not saved and just used to support another task (like translation
modeling in SMT). Looking at automatic word alignment results it can
sometimes (often?) feel like a joke but it can still be useful for many
tasks as you have seen in the responses to your query.

Good luck with further discussions with your colleagues!

Jörg


On Thu, Jun 2, 2011 at 5:14 PM, Xu Jiajin <ustcxujj at gmail.com> wrote:

> Two days ago, I asked about Word Alignment, which was kindly responded by
> eight colleagues (Alberto Simões, Afsaneh Fazly, Mark Sammons, Graeme
> Hirst, Felipe Sánchez Martínez, Dekai Wu, Michael Barlow, and João Graça).
>
>
>
>
> One of my first observations from the informative responses is that most
> of, with one or two exceptions, colleagues are from the department of
> Computer Science, and works in the Computational Linguistics. This might be
> a perfect excuse that I was not aware of the enormous work done in Word
> Alignment, as I am a linguist with a theoretical flavour. :) :). Most
> linguists in contrastive linguistics and translation studies see sentence
> alignment as the only reliable and viable correspondence of linguistic
> units. However, when we look around and beyond the scope of pure language
> studies, the aligning work is far more than sentence alignment, especially
> after the discussion.
>
>
>
> I’d summarize the discussions as follows:
>
> Word Alignments are used in a variety of applications.
>
> 1. All Statistical Machine Translation systems, starting from word
> alignments to extract translation units.
>
> 2. Jointly training models in different languages and coupling them for
> better learning.
>
> 3. Passing annotations from one language to the other.
>
> There are several good implementations of word alignments, Poscat, Berkley
> aligner, GIZA++ (Franz Och), mkcls (Franz Och) just to name a few.
>
>
>
> Word alignments do not have to be one to one, they can be many to many and
> hence we can have phrase alignments.
>
> --the above adapted from João Graça
>
>
>
> Word alignment is term alignment to some extent, and possible term with
> blanks or placeholders alignment, and possible alignment to empty words
> (just like we have sentence alignment to empty sentences).
>
> 100% of word alignment might be difficult or even impossible (for compound
> verbs, for instance).
>
> Calling it ‘a joke’ (The inappropriate wording in my target post) can be
> offending to people working on word alignment.
>
> --the above adapted from Alberto Simões
>
>
>
> Related implementations and literature
>
> The Mathematics of Statistical Machine Translation: Parameter Estimation.
>
> Peter E Brown, Vincent J. Della Pietra, Stephen A. Della Pietra,
>
> Robert L. Mercer Computational Linguistics, 1993.
>
>
>
> The alignment template approach to statistical machine translation.
>
> Franz Josef Och and Hermann Ney. Computational Linguistics, 30:417–449.
> 2004.
>
>
>
> Jörg Tiedemann's book "Bitext Alignment", which is about to be published
> (probably this week!) by Morgan & Claypool (morganclaypool.com) in their
> HLT Synthesis series.  It includes a 45-page chapter on word alignment.
> (provided by Graeme Hirst)
>
>
>
> Word alignment implementations have been around for a while: GIZA++ (
> http://code.google.com/p/giza-pp/) is the most used, but there are other
> word aligners such as BerkeleyAligner (
> http://code.google.com/p/berkeleyaligner).
>
>
>
> GIZA++ implements the alignments models described in
>
> Och, Franz Josef, and Hermann Ney (2003) "A Systematic Comparison of
> Various Statistical Alignment Models." Computational Linguistics 29(1):
> 19-51. http://acl.ldc.upenn.edu/J/J03/J03-1002.pdf
>
>
>
> Dekai Wu’s "Alignment" chapter in the Handbook of Natural Language
> Processing.  The chapter has been extensively revised for the new second
> edition (2010), edited by N. Indurkhya & F.J. Damerau, Chapman and Hall /
> CRC Press, pp.367-408. (It covers token vs segmental alignments, at word,
> phrase/collocation, and sentence levels. Starting from flat models, it
> progressively moves to compositional/hierarchical models that can handle the
> sorts of constructions and idioms you are thinking about, using biparsing
> with transduction grammars.)
>
>
>
> Thanks go to all the participants of the discussion, which is enlightening
> and informative indeed.
>
>
>
>
>
> Best wishes,
>
>
> Jiajin XU
>
> Beijing Foreign Studies University
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>


-- 
**********************************************************************************
 Jörg Tiedemann
jorg.tiedemann at lingfil.uu.se
 Dep. of Linguistics and Philology
http://stp.lingfil.uu.se/~joerg/
 Uppsala University                                  tel:  +46 (0)18 - 471
1412
 Box 635, SE-751 26 Uppsala/SWEDEN   fax: +46 (0)18 - 471 1094
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110604/105a7621/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list