[Corpora-List] Summary: lexicographic tools for parallel/comparable corpora

Fri Feb 23 12:20:49 UTC 2007

Here is a summary of responses to my question:
"I'm looking for information about tools for the lexicographic use of
parallel and comparable corpora."

Short summary:

First of all there do not seem to be many lexicographic projects that use
parallel/comparable corpora. Raphael Salkie pointed me to the Dictionnaire
canadien bilingue for which parallel corpora where used (back in 1996).
There are papers talking about the use of parallel/comparable corpora in
dictionary building (e.g. Corréard (2005) and Krishnamurthy (2005)) but 
there are no projects mentioned explicitly. The main problem seems to be 
the lack of "clean", suitable data in reasonable quantities (pointed out 
by several people). Adam Kilgarriff and his team used monolingual corpora 
and his SketchEngine for bilingual lexicography (English-Irish) (which is 
a step towards using comparable corpora I believe) but he points out that 
"...  we're a fair way off from `bilingual word sketches' ...". Lieve 
Macken reminded me that the topic is close related to multi-lingual 
terminology extraction end there is, of course, a rich literature about it 
(some references below).

Here are some pointers I got about available tools:

ParaConc: a commercial parallel concordancer (athel.com)

There is an online implementation of the Vanilla-aligner at 
http://www2.lael.pucsp.br/corpora/alinhador/index.html and an online 
parallel concordancer at 
http://www2.lael.pucsp.br/corpora/parallelconc/index.html
used by students

Thomas Schmidt used a combination of a parallel concordancing tool and a
lexicographic annotation tool for the construction of a multilingual 
football dictionary (www.kicktionary.de)

The Finnish translation technology company Masterin has a bilingual term
extractor that builds a raw bilingual translation lexicon from translation
memory databases.

Grigori Sidorov has a research tool that performs lexical-based alignment 
for English-Spanish parallel corpora.

CLaRK is an XML based system for corpora development with support for 
document synchronization to be used to navigate through parallel corpora. 
http://www.bultreebank.org/clark/

A web-based corpus interface: http://corpus.leeds.ac.uk/internet.html
(software available at http://csar.sourceforge.net/) - I'm not sure about 
its support for parallel and comparable corpora ...

Well, I add from my experience some more related tools available:

various implementations of Gale&Church's sentence alignment algorithm
(e.g. http://nl.ijs.si/telri/Vanilla/),
Melameds GMA (http://nlp.cs.nyu.edu/GMA/), 
Hunalign (http://mokk.bme.hu/resources/hunalign),  
Champollion Tool Kit (http://champollion.sourceforge.net/)
Berger's align tool (http://www.cse.unt.edu/~rada/wa/tools/aberger/)
Moore's sentence aligner (http://research.microsoft.com/users/bobmoore/)
GIZA++
(http://www-i6.informatik.rwth-aachen.de/Colleagues/och/software/GIZA++.html) 
Twente word aligner
(http://wwwhome.cs.utwente.nl/~irgroup/align/download.html) now in the NA
Tools package (http://natura.di.uminho.pt/natura/natura?&topic=NATools)
ILink (http://www.ida.liu.se/~nlplab/ILink/),
K-vec++ (http://www.d.umn.edu/~tpederse/parallel.html)
CWB from IMS stuttgart with support for aligned corpora
(http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/)
Uplug (http://sourceforge.net/projects/uplug)
... there are more tools for visualization and manual alignment ...

I probably forgot a lot of links (that's why I asked on the list) - feel 
free to remind me!

Some references to literature I got:

Corréard, M.-H. 2005. Bilingual Lexicography. In K. Brown (ed.)
Encyclopedia of Language and Linguistics,  2nd Edn., Vol. 1, (Oxford:
Elsevier), 787-796.

Krishnamurthy, R.  2005. Corpus Lexicography. In K. Brown (ed.)
Encyclopedia of Language and Linguistics,  2nd Edn., Vol. 3, (Oxford:
Elsevier), 250-254.

Roberts, R.P. 1996. Parallel Text Analysis and Bilingual
Lexicography. Available from http://www.dico.uottawa.ca/articles-fr.htm

I. Dan Melamed's 2001 book/dissertation "Empirical Methods for Exploiting 
Parallel Texts", MIT Press. There is a lot more in his website 
http://cs.nyu.edu/~melamed/ .

Dan Tufis, Ana Maria Barbu, Radu Ion, Extracting Multilingual Lexicons 
from Parallel Corpora, Computers and the Humanities, Volume 38, Issue 2, 
May 2004, Pages 163 ~V 189   
(http://dx.doi.org/10.1023/B:CHUM.0000031172.03949.48) ISSB 0010-4817

Dan Tufis 'A cheap and fast way to build useful translation lexicons' in
Proceedings of the 19th International Conference on Computational
Linguistics, COLING2002,  Taipei, 25-30 August, 2002, pp. 1030-1036, ISBN 
1-55860-894

(more papers on Dan Tufis homepage http://www.racai.ro/~tufis/)

Alexander Gelbukh and Grigori Sidorov. Alignment of Paragraphs in 
Bilingual Texts using Bilingual Dictionaries and Dynamic Programming. 
Lecture Notes in Computer Science, N 4225, Springer-Verlag, 2006, pp 
824-833.

two links about "bilingual terminology extraction on comparable corora":
acl.ldc.upenn.edu/P/P04/P04-1067.pdf
acl.ldc.upenn.edu/acl2003/iral/ps/Sadat.ps

Thanks for responses:

Marie-Paule Jacques <marie-paule.jacques at lipn.univ-paris13.fr>
Michael Barlow <mi.barlow at auckland.ac.nz>
Thomas Schmidt <thomas.schmidt at uni-hamburg.de>
Tony Berber Sardinha <tony4 at uol.com.br>
Mickel Grönroos <mickel.gronroos at masterin.com>
Grigori Sidorov <sidorov at cic.ipn.mx>
Raphael Salkie <R.M.Salkie at bton.ac.uk>
Dan Tufis <tufis at racai.ro>
Serge Sharoff <s.sharoff at leeds.ac.uk>
Adam Kilgarriff <adam at lexmasterclass.com>
Kiril Simov <kivs at bultreebank.org>
Alex Murzaku <lissus at gmail.com>
Lieve Macken <lieve.macken at hogent.be>

Jörg

***********/\/\/\/\/\/\/\/\/\/\/\************************************
**  Jörg Tiedemann                 tiedeman at let.rug.nl             **
**  Alfa-Informatica               http://www.let.rug.nl/~tiedeman **
**  Rijksuniversiteit Groningen     Harmoniegebouw, room 1311-429  **
**  Oude Kijk in 't Jatstraat 26    phone: +31 (0)50-363 5935      **
**  9712 EK Groningen               fax:   +31 (0)50-363 6855      **
*************************************/\/\/\/\/\/\/\/\/\/\/\**********