[Corpora-List] Summary: lexicographic tools for parallel/comparable corpora

Fri Feb 23 13:38:08 UTC 2007

Dear Joerg

The Oxford-Hachette French Dictionary (1994) was "based on
two electronic text collections, one French and one English,
each containing over 10 million words" (cover flap).

Best
Ramesh

At 12:20 23/02/2007, Joerg Tiedemann wrote:

>Here is a summary of responses to my question:
>"I'm looking for information about tools for the lexicographic use of
>parallel and comparable corpora."
>
>
>Short summary:
>
>First of all there do not seem to be many lexicographic projects that use
>parallel/comparable corpora. Raphael Salkie pointed me to the Dictionnaire
>canadien bilingue for which parallel corpora where used (back in 1996).
>There are papers talking about the use of parallel/comparable corpora in
>dictionary building (e.g. Corréard (2005) and Krishnamurthy (2005)) but
>there are no projects mentioned explicitly. The main problem seems to be
>the lack of "clean", suitable data in reasonable quantities (pointed out
>by several people). Adam Kilgarriff and his team used monolingual corpora
>and his SketchEngine for bilingual lexicography (English-Irish) (which is
>a step towards using comparable corpora I believe) but he points out that
>"...  we're a fair way off from `bilingual word sketches' ...". Lieve
>Macken reminded me that the topic is close related to multi-lingual
>terminology extraction end there is, of course, a rich literature about it
>(some references below).
>
>
>
>Here are some pointers I got about available tools:
>
>
>ParaConc: a commercial parallel concordancer (athel.com)
>
>There is an online implementation of the Vanilla-aligner at
>http://www2.lael.pucsp.br/corpora/alinhador/index.html and an online
>parallel concordancer at
>http://www2.lael.pucsp.br/corpora/parallelconc/index.html
>used by students
>
>Thomas Schmidt used a combination of a parallel concordancing tool and a
>lexicographic annotation tool for the construction of a multilingual
>football dictionary (www.kicktionary.de)
>
>The Finnish translation technology company Masterin has a bilingual term
>extractor that builds a raw bilingual translation lexicon from translation
>memory databases.
>
>Grigori Sidorov has a research tool that performs lexical-based alignment
>for English-Spanish parallel corpora.
>
>CLaRK is an XML based system for corpora development with support for
>document synchronization to be used to navigate through parallel corpora.
>http://www.bultreebank.org/clark/
>
>A web-based corpus interface: http://corpus.leeds.ac.uk/internet.html
>(software available at http://csar.sourceforge.net/) - I'm not sure about
>its support for parallel and comparable corpora ...
>
>
>Well, I add from my experience some more related tools available:
>
>various implementations of Gale&Church's sentence alignment algorithm
>(e.g. http://nl.ijs.si/telri/Vanilla/),
>Melameds GMA (http://nlp.cs.nyu.edu/GMA/),
>Hunalign (http://mokk.bme.hu/resources/hunalign),
>Champollion Tool Kit (http://champollion.sourceforge.net/)
>Berger's align tool (http://www.cse.unt.edu/~rada/wa/tools/aberger/)
>Moore's sentence aligner (http://research.microsoft.com/users/bobmoore/)
>GIZA++
>(http://www-i6.informatik.rwth-aachen.de/Colleagues/och/software/GIZA++.html)
>Twente word aligner
>(http://wwwhome.cs.utwente.nl/~irgroup/align/download.html) now in the NA
>Tools package (http://natura.di.uminho.pt/natura/natura?&topic=NATools)
>ILink (http://www.ida.liu.se/~nlplab/ILink/),
>K-vec++ (http://www.d.umn.edu/~tpederse/parallel.html)
>CWB from IMS stuttgart with support for aligned corpora
>(http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/)
>Uplug (http://sourceforge.net/projects/uplug)
>... there are more tools for visualization and manual alignment ...
>
>I probably forgot a lot of links (that's why I asked on the list) - feel
>free to remind me!
>
>
>
>Some references to literature I got:
>
>Corréard, M.-H. 2005. Bilingual Lexicography. In K. Brown (ed.)
>Encyclopedia of Language and Linguistics,  2nd Edn., Vol. 1, (Oxford:
>Elsevier), 787-796.
>
>Krishnamurthy, R.  2005. Corpus Lexicography. In K. Brown (ed.)
>Encyclopedia of Language and Linguistics,  2nd Edn., Vol. 3, (Oxford:
>Elsevier), 250-254.
>
>Roberts, R.P. 1996. Parallel Text Analysis and Bilingual
>Lexicography. Available from http://www.dico.uottawa.ca/articles-fr.htm
>
>I. Dan Melamed's 2001 book/dissertation "Empirical Methods for Exploiting
>Parallel Texts", MIT Press. There is a lot more in his website
>http://cs.nyu.edu/~melamed/ .
>
>Dan Tufis, Ana Maria Barbu, Radu Ion, Extracting Multilingual Lexicons
>from Parallel Corpora, Computers and the Humanities, Volume 38, Issue 2,
>May 2004, Pages 163 ~V 189
>(http://dx.doi.org/10.1023/B:CHUM.0000031172.03949.48) ISSB 0010-4817
>
>Dan Tufis 'A cheap and fast way to build useful translation lexicons' in
>Proceedings of the 19th International Conference on Computational
>Linguistics, COLING2002,  Taipei, 25-30 August, 2002, pp. 1030-1036, ISBN
>1-55860-894
>
>(more papers on Dan Tufis homepage http://www.racai.ro/~tufis/)
>
>Alexander Gelbukh and Grigori Sidorov. Alignment of Paragraphs in
>Bilingual Texts using Bilingual Dictionaries and Dynamic Programming.
>Lecture Notes in Computer Science, N 4225, Springer-Verlag, 2006, pp
>824-833.
>
>two links about "bilingual terminology extraction on comparable corora":
>acl.ldc.upenn.edu/P/P04/P04-1067.pdf
>acl.ldc.upenn.edu/acl2003/iral/ps/Sadat.ps
>
>
>Thanks for responses:
>
>Marie-Paule Jacques <marie-paule.jacques at lipn.univ-paris13.fr>
>Michael Barlow <mi.barlow at auckland.ac.nz>
>Thomas Schmidt <thomas.schmidt at uni-hamburg.de>
>Tony Berber Sardinha <tony4 at uol.com.br>
>Mickel Grönroos <mickel.gronroos at masterin.com>
>Grigori Sidorov <sidorov at cic.ipn.mx>
>Raphael Salkie <R.M.Salkie at bton.ac.uk>
>Dan Tufis <tufis at racai.ro>
>Serge Sharoff <s.sharoff at leeds.ac.uk>
>Adam Kilgarriff <adam at lexmasterclass.com>
>Kiril Simov <kivs at bultreebank.org>
>Alex Murzaku <lissus at gmail.com>
>Lieve Macken <lieve.macken at hogent.be>
>
>
>
>
>Jörg
>
>***********/\/\/\/\/\/\/\/\/\/\/\************************************
>**  Jörg Tiedemann                 tiedeman at let.rug.nl             **
>**  Alfa-Informatica               http://www.let.rug.nl/~tiedeman **
>**  Rijksuniversiteit Groningen     Harmoniegebouw, room 1311-429  **
>**  Oude Kijk in 't Jatstraat 26    phone: +31 (0)50-363 5935      **
>**  9712 EK Groningen               fax:   +31 (0)50-363 6855      **
>*************************************/\/\/\/\/\/\/\/\/\/\/\**********

Ramesh Krishnamurthy

Lecturer in English Studies, School of Languages 
and Social Sciences, Aston University, Birmingham B4 7ET, UK
[Room NX08, North Wing of Main Building] ; Tel: 
+44 (0)121-204-3812 ; Fax: +44 (0)121-204-3766
http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp

Project Leader, ACORN (Aston Corpus Network): http://corpus.aston.ac.uk/