[Corpora-List] Summary: lexicographic tools for parallel/comparable corpora

Mon Feb 26 11:13:10 UTC 2007

Hello Joerg,
thank you for this useful summary.
I have not replied earlier to your question because I thougth that there 
were some tools more specifically designed for lexicographers.
But it appears that most of the links you give concern generic tools for 
multilingual corpora handling.

I give you one more link : Alinea is a free aligner and parallel 
concordancer that has been evaluated in the last Arcade 2 campaign.
It obtained results close to the best system (aroud 98% F-measure for 
european language pairs) and showed that it was particularly robust even 
for very "distant" language pairs (French-Chinese, French-Arabic, 
French-Farsi, etc.).
Alinea can handle POS-tagged texts, complex expressions searching 
(regular expressions with tags and lemma), word-to-word aligning, and 
bilingual lexicon extraction.

http://w3.u-grenoble3.fr/kraif/index.php?option=com_content&task=view&id=27&Itemid=43

On the same site you can find :
- a review of links about tools : 
http://w3.u-grenoble3.fr/kraif/index.php?option=com_content&task=view&id=23&Itemid=41
- about corpora : 
http://w3.u-grenoble3.fr/kraif/index.php?option=com_content&task=view&id=20&Itemid=36
- and about sources of parallel texts : 
http://w3.u-grenoble3.fr/kraif/index.php?option=com_content&task=view&id=22&Itemid=38

Best regards

Olivier

> Here is a summary of responses to my question:
> "I'm looking for information about tools for the lexicographic use of
> parallel and comparable corpora."
>
>
> Short summary:
>
> First of all there do not seem to be many lexicographic projects that use
> parallel/comparable corpora. Raphael Salkie pointed me to the Dictionnaire
> canadien bilingue for which parallel corpora where used (back in 1996).
> There are papers talking about the use of parallel/comparable corpora in
> dictionary building (e.g. Corréard (2005) and Krishnamurthy (2005)) but 
> there are no projects mentioned explicitly. The main problem seems to be 
> the lack of "clean", suitable data in reasonable quantities (pointed out 
> by several people). Adam Kilgarriff and his team used monolingual corpora 
> and his SketchEngine for bilingual lexicography (English-Irish) (which is 
> a step towards using comparable corpora I believe) but he points out that 
> "...  we're a fair way off from `bilingual word sketches' ...". Lieve 
> Macken reminded me that the topic is close related to multi-lingual 
> terminology extraction end there is, of course, a rich literature about it 
> (some references below).
>
>
>
> Here are some pointers I got about available tools:
>
>
> ParaConc: a commercial parallel concordancer (athel.com)
>
> There is an online implementation of the Vanilla-aligner at 
> http://www2.lael.pucsp.br/corpora/alinhador/index.html and an online 
> parallel concordancer at 
> http://www2.lael.pucsp.br/corpora/parallelconc/index.html
> used by students
>
> Thomas Schmidt used a combination of a parallel concordancing tool and a
> lexicographic annotation tool for the construction of a multilingual 
> football dictionary (www.kicktionary.de)
>
> The Finnish translation technology company Masterin has a bilingual term
> extractor that builds a raw bilingual translation lexicon from translation
> memory databases.
>
> Grigori Sidorov has a research tool that performs lexical-based alignment 
> for English-Spanish parallel corpora.
>
> CLaRK is an XML based system for corpora development with support for 
> document synchronization to be used to navigate through parallel corpora. 
> http://www.bultreebank.org/clark/
>
> A web-based corpus interface: http://corpus.leeds.ac.uk/internet.html
> (software available at http://csar.sourceforge.net/) - I'm not sure about 
> its support for parallel and comparable corpora ...
>
>
> Well, I add from my experience some more related tools available:
>
> various implementations of Gale&Church's sentence alignment algorithm
> (e.g. http://nl.ijs.si/telri/Vanilla/),
> Melameds GMA (http://nlp.cs.nyu.edu/GMA/), 
> Hunalign (http://mokk.bme.hu/resources/hunalign),  
> Champollion Tool Kit (http://champollion.sourceforge.net/)
> Berger's align tool (http://www.cse.unt.edu/~rada/wa/tools/aberger/)
> Moore's sentence aligner (http://research.microsoft.com/users/bobmoore/)
> GIZA++
> (http://www-i6.informatik.rwth-aachen.de/Colleagues/och/software/GIZA++.html) 
> Twente word aligner
> (http://wwwhome.cs.utwente.nl/~irgroup/align/download.html) now in the NA
> Tools package (http://natura.di.uminho.pt/natura/natura?&topic=NATools)
> ILink (http://www.ida.liu.se/~nlplab/ILink/),
> K-vec++ (http://www.d.umn.edu/~tpederse/parallel.html)
> CWB from IMS stuttgart with support for aligned corpora
> (http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/)
> Uplug (http://sourceforge.net/projects/uplug)
> ... there are more tools for visualization and manual alignment ...
>
> I probably forgot a lot of links (that's why I asked on the list) - feel 
> free to remind me!
>
>
>
> Some references to literature I got:
>
> Corréard, M.-H. 2005. Bilingual Lexicography. In K. Brown (ed.)
> Encyclopedia of Language and Linguistics,  2nd Edn., Vol. 1, (Oxford:
> Elsevier), 787-796.
>
> Krishnamurthy, R.  2005. Corpus Lexicography. In K. Brown (ed.)
> Encyclopedia of Language and Linguistics,  2nd Edn., Vol. 3, (Oxford:
> Elsevier), 250-254.
>
> Roberts, R.P. 1996. Parallel Text Analysis and Bilingual
> Lexicography. Available from http://www.dico.uottawa.ca/articles-fr.htm
>
> I. Dan Melamed's 2001 book/dissertation "Empirical Methods for Exploiting 
> Parallel Texts", MIT Press. There is a lot more in his website 
> http://cs.nyu.edu/~melamed/ .
>
> Dan Tufis, Ana Maria Barbu, Radu Ion, Extracting Multilingual Lexicons 
> from Parallel Corpora, Computers and the Humanities, Volume 38, Issue 2, 
> May 2004, Pages 163 ~V 189   
> (http://dx.doi.org/10.1023/B:CHUM.0000031172.03949.48) ISSB 0010-4817
>
> Dan Tufis 'A cheap and fast way to build useful translation lexicons' in
> Proceedings of the 19th International Conference on Computational
> Linguistics, COLING2002,  Taipei, 25-30 August, 2002, pp. 1030-1036, ISBN 
> 1-55860-894
>
> (more papers on Dan Tufis homepage http://www.racai.ro/~tufis/)
>
> Alexander Gelbukh and Grigori Sidorov. Alignment of Paragraphs in 
> Bilingual Texts using Bilingual Dictionaries and Dynamic Programming. 
> Lecture Notes in Computer Science, N 4225, Springer-Verlag, 2006, pp 
> 824-833.
>
> two links about "bilingual terminology extraction on comparable corora":
> acl.ldc.upenn.edu/P/P04/P04-1067.pdf
> acl.ldc.upenn.edu/acl2003/iral/ps/Sadat.ps
>
>
> Thanks for responses:
>
> Marie-Paule Jacques <marie-paule.jacques at lipn.univ-paris13.fr>
> Michael Barlow <mi.barlow at auckland.ac.nz>
> Thomas Schmidt <thomas.schmidt at uni-hamburg.de>
> Tony Berber Sardinha <tony4 at uol.com.br>
> Mickel Grönroos <mickel.gronroos at masterin.com>
> Grigori Sidorov <sidorov at cic.ipn.mx>
> Raphael Salkie <R.M.Salkie at bton.ac.uk>
> Dan Tufis <tufis at racai.ro>
> Serge Sharoff <s.sharoff at leeds.ac.uk>
> Adam Kilgarriff <adam at lexmasterclass.com>
> Kiril Simov <kivs at bultreebank.org>
> Alex Murzaku <lissus at gmail.com>
> Lieve Macken <lieve.macken at hogent.be>
>
>
>
>
> Jörg
>
> ***********/\/\/\/\/\/\/\/\/\/\/\************************************
> **  Jörg Tiedemann                 tiedeman at let.rug.nl             **
> **  Alfa-Informatica               http://www.let.rug.nl/~tiedeman **
> **  Rijksuniversiteit Groningen     Harmoniegebouw, room 1311-429  **
> **  Oude Kijk in 't Jatstraat 26    phone: +31 (0)50-363 5935      **
> **  9712 EK Groningen               fax:   +31 (0)50-363 6855      **
> *************************************/\/\/\/\/\/\/\/\/\/\/\**********
>