[Corpora-List] Word Alignment does exist and goes well: A summary

Dekai Wu dekai at cs.ust.hk
Wed Jun 22 17:22:02 UTC 2011


It was pointed out that many people didn't have access to my "Alignment" 
chapter in the Handbook of Natural Language Processing.  Happily, CRC 
Press has (very kindly!) granted me permission to post the full PDF as 
as a free sample chapter.

You can download it from 
http://www.cs.ust.hk/~dekai/library/WU_Dekai/Wu_Alignment2010.pdf

As summarized by Xu Jiajin:  "The chapter has been extensively revised 
for the new second edition (2010), edited by N. Indurkhya & F.J. 
Damerau, Chapman and Hall / CRC Press, pp.367-408. (It covers token vs 
segmental alignments, at word, phrase/collocation, and sentence levels. 
Starting from flat models, it progressively moves to 
compositional/hierarchical models that can handle the sorts of 
constructions and idioms you are thinking about, using biparsing with 
transduction grammars.)"

Hope this helps!

-- 
Prof. Dekai Wu   |   dekai at cs.ust.hk   |   http://www.cs.ust.hk/~dekai
HKUST Human Language Technology Center
Department of Computer Science and Engineering
University of Science & Technology, Clear Water Bay, Hong Kong
tel +852 2358.7000   |  dir +852 2358.6989   |  fax +852 2358.1477



Xu Jiajin wrote:
> Two days ago, I asked about Word Alignment, which was kindly responded 
> by eight colleagues (Alberto Simões, Afsaneh Fazly, Mark Sammons, 
> Graeme Hirst, Felipe Sánchez Martínez, Dekai Wu, Michael Barlow, and 
> João Graça).
>
>  
>
> One of my first observations from the informative responses is that 
> most of, with one or two exceptions, colleagues are from the 
> department of Computer Science, and works in the Computational 
> Linguistics. This might be a perfect excuse that I was not aware of 
> the enormous work done in Word Alignment, as I am a linguist with a 
> theoretical flavour. :) :). Most linguists in contrastive linguistics 
> and translation studies see sentence alignment as the only reliable 
> and viable correspondence of linguistic units. However, when we look 
> around and beyond the scope of pure language studies, the aligning 
> work is far more than sentence alignment, especially after the discussion.
>
>  
>
> I’d summarize the discussions as follows:
>
> Word Alignments are used in a variety of applications.
>
> 1. All Statistical Machine Translation systems, starting from word 
> alignments to extract translation units.
>
> 2. Jointly training models in different languages and coupling them 
> for better learning.
>
> 3. Passing annotations from one language to the other.
>
> There are several good implementations of word alignments, Poscat, 
> Berkley aligner, GIZA++ (Franz Och), mkcls (Franz Och) just to name a few.
>
>  
>
> Word alignments do not have to be one to one, they can be many to many 
> and hence we can have phrase alignments.
>
> --the above adapted from João Graça
>
>  
>
> Word alignment is term alignment to some extent, and possible term 
> with blanks or placeholders alignment, and possible alignment to empty 
> words (just like we have sentence alignment to empty sentences).
>
> 100% of word alignment might be difficult or even impossible (for 
> compound verbs, for instance).
>
> Calling it ‘a joke’ (The inappropriate wording in my target post) can 
> be offending to people working on word alignment.
>
> --the above adapted from Alberto Simões
>
>  
>
> Related implementations and literature
>
> The Mathematics of Statistical Machine Translation: Parameter Estimation.
>
> Peter E Brown, Vincent J. Della Pietra, Stephen A. Della Pietra,
>
> Robert L. Mercer Computational Linguistics, 1993.
>
>  
>
> The alignment template approach to statistical machine translation.
>
> Franz Josef Och and Hermann Ney. Computational Linguistics, 
> 30:417–449. 2004.
>
>  
>
> Jörg Tiedemann's book "Bitext Alignment", which is about to be 
> published (probably this week!) by Morgan & Claypool 
> (morganclaypool.com <http://morganclaypool.com>) in their HLT 
> Synthesis series.  It includes a 45-page chapter on word alignment. 
> (provided by Graeme Hirst)
>
>  
>
> Word alignment implementations have been around for a while: GIZA++ 
> (http://code.google.com/p/giza-pp/) is the most used, but there are 
> other word aligners such as BerkeleyAligner 
> (http://code.google.com/p/berkeleyaligner).
>
>  
>
> GIZA++ implements the alignments models described in
>
> Och, Franz Josef, and Hermann Ney (2003) "A Systematic Comparison of 
> Various Statistical Alignment Models." Computational Linguistics 
> 29(1): 19-51. http://acl.ldc.upenn.edu/J/J03/J03-1002.pdf
>
>  
>
> Dekai Wu’s "Alignment" chapter in the Handbook of Natural Language 
> Processing.  The chapter has been extensively revised for the new 
> second edition (2010), edited by N. Indurkhya & F.J. Damerau, Chapman 
> and Hall / CRC Press, pp.367-408. (It covers token vs segmental 
> alignments, at word, phrase/collocation, and sentence levels. Starting 
> from flat models, it progressively moves to compositional/hierarchical 
> models that can handle the sorts of constructions and idioms you are 
> thinking about, using biparsing with transduction grammars.)
>
>  
>
> Thanks go to all the participants of the discussion, which is 
> enlightening and informative indeed.
>
>  
>
>  
>
> Best wishes,
>
>
> Jiajin XU
>
> Beijing Foreign Studies University
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110623/38bcdba5/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list