[Corpora-List] Summary - sentence aligner script

Tony Berber Sardinha tony4 at uol.com.br
Tue Dec 17 16:09:11 UTC 2002


Dear list members

thanks to all who replied to my query about sentence aligner scripts:
Susan Armstrong, Torgny Rasmark, Jean Veronis, François Maniez, Tomaz Erjavec,
Marco Baroni
Below are the replies that I got:

>>Susan Armstrong

We have a publicly available aligner made for an EU project some years
ago - available at - http://www.issco.unige.ch/tools/

>>Torgny Rasmark

vanilla aligner  (for DOS) :
http://spraakbanken.gu.se/lb/English/downloads.html

>>Jean Veronis

see:

Gale, W., and Church, K. (1993) "A Program for Aligning Sentences in
Bilingual Corpora," Computational Linguistics, 19:1, pp. 75-102.

There is a C program published at the end of the paper. It is available
from Ken's page at:

http://www.research.att.com/~kwc/publications.html

>>François Maniez

Hello,

this is not about perl or Unix, but I have written a Word macro that does
the trick if the original format of your data is an x-column table where x
is the number of languages included in your parallel corpus (I am currently
building a medical corpus from files available on the European Commission
website in English, French, German, Italian, Spanish and Portuguese, in
order to test terminological extraction algorithms).

The output of the macro needs to be manually corrected, as one sentence will
occasionally be translated in two sentences and vice-versa.

Let me know if you're interested, and I'll send it along.

Cheers,

François MANIEZ
Maître de Conférences
Centre de Recherche en Terminologie et en Traduction
Département de Langues Étrangères Appliquées
Université Lumière Lyon 2
maniezf at univ-lyon2.fr
fmaniez at wanadoo.fr
http://nte.univ-lyon2.fr/~maniezf/recherche.html

>>Tomaz Erjavec

Hi,
Vanilla can also be found at
http://nl.ijs.si/telri/Vanilla/
complete with an accompanying paper and free to download!
Best,
Tomaz

>>Marco Baroni

Hi!

There is a version of the Vanilla aligner, pre-compiled for DOS, on the
following site:

http://spraakbanken.gu.se/lb/downloads.html

It is possible to download a compressed archive from there, but, as I
don't understand Swedish (assuming it is Swedish...), I don't know if
there are any restricions on its use.

Also, if you go to Kenneth Curch's publications page, you can download the
text version of

Gale, W., and Church, K. (1993) ³A Program for Aligning
Sentences in Bilingual Corpora,² Computational Linguistics, 19:1, pp.
75-102

which contains the source code for their famous aligner as an appendix.

Regards,

Marco Baroni


cheers
tony.
-------------------------------------
Dr Tony Berber Sardinha
LAEL, PUC/SP
(Catholic University of Sao Paulo, Brazil)
tony4 at uol.com.br
http://lael.pucsp.br/~tony
[New website]



More information about the Corpora mailing list