[Corpora-List] Word Alignment does exist and goes well: A summary

Xu Jiajin ustcxujj at gmail.com
Thu Jun 2 15:14:57 UTC 2011


Two days ago, I asked about Word Alignment, which was kindly responded by
eight colleagues (Alberto Simões, Afsaneh Fazly, Mark Sammons, Graeme Hirst,
Felipe Sánchez Martínez, Dekai Wu, Michael Barlow, and João Graça).



One of my first observations from the informative responses is that most of,
with one or two exceptions, colleagues are from the department of Computer
Science, and works in the Computational Linguistics. This might be a perfect
excuse that I was not aware of the enormous work done in Word Alignment, as
I am a linguist with a theoretical flavour. :) :). Most linguists in
contrastive linguistics and translation studies see sentence alignment as
the only reliable and viable correspondence of linguistic units. However,
when we look around and beyond the scope of pure language studies, the
aligning work is far more than sentence alignment, especially after the
discussion.



I’d summarize the discussions as follows:

Word Alignments are used in a variety of applications.

1. All Statistical Machine Translation systems, starting from word
alignments to extract translation units.

2. Jointly training models in different languages and coupling them for
better learning.

3. Passing annotations from one language to the other.

There are several good implementations of word alignments, Poscat, Berkley
aligner, GIZA++ (Franz Och), mkcls (Franz Och) just to name a few.



Word alignments do not have to be one to one, they can be many to many and
hence we can have phrase alignments.

--the above adapted from João Graça



Word alignment is term alignment to some extent, and possible term with
blanks or placeholders alignment, and possible alignment to empty words
(just like we have sentence alignment to empty sentences).

100% of word alignment might be difficult or even impossible (for compound
verbs, for instance).

Calling it ‘a joke’ (The inappropriate wording in my target post) can be
offending to people working on word alignment.

--the above adapted from Alberto Simões



Related implementations and literature

The Mathematics of Statistical Machine Translation: Parameter Estimation.

Peter E Brown, Vincent J. Della Pietra, Stephen A. Della Pietra,

Robert L. Mercer Computational Linguistics, 1993.



The alignment template approach to statistical machine translation.

Franz Josef Och and Hermann Ney. Computational Linguistics, 30:417–449.
2004.



Jörg Tiedemann's book "Bitext Alignment", which is about to be published
(probably this week!) by Morgan & Claypool (morganclaypool.com) in their HLT
Synthesis series.  It includes a 45-page chapter on word alignment.
(provided by Graeme Hirst)



Word alignment implementations have been around for a while: GIZA++ (
http://code.google.com/p/giza-pp/) is the most used, but there are other
word aligners such as BerkeleyAligner (
http://code.google.com/p/berkeleyaligner).



GIZA++ implements the alignments models described in

Och, Franz Josef, and Hermann Ney (2003) "A Systematic Comparison of Various
Statistical Alignment Models." Computational Linguistics 29(1): 19-51.
http://acl.ldc.upenn.edu/J/J03/J03-1002.pdf



Dekai Wu’s "Alignment" chapter in the Handbook of Natural Language
Processing.  The chapter has been extensively revised for the new second
edition (2010), edited by N. Indurkhya & F.J. Damerau, Chapman and Hall /
CRC Press, pp.367-408. (It covers token vs segmental alignments, at word,
phrase/collocation, and sentence levels. Starting from flat models, it
progressively moves to compositional/hierarchical models that can handle the
sorts of constructions and idioms you are thinking about, using biparsing
with transduction grammars.)



Thanks go to all the participants of the discussion, which is enlightening
and informative indeed.





Best wishes,


Jiajin XU

Beijing Foreign Studies University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110602/a2cd0d3e/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list