[Corpora-List] Some comments on aligners

Thu Sep 5 11:04:31 UTC 2002

Dear colleagues,

It sounds to me somehow a waste of time and resources to be discussing
aligners for a particular commercial application such as ParaConc in this
list (I know that was the initial question...), given that there are so many
other systems that may cater for better functionalities of search in
paralell corpora and which are moreover free and already existing.

So, after some reflection, I decided, to prevent some naive readers of the
list to conclude that the only existing aligners were the ones discussed in
the previous mail thread, to talk about our approach in COMPARA, basically
to suggest to anyone involved in parallel corpora work to use

1) the IMS Corpus Workbench developed at Stuttgart (Stefan Evert and Ulrich
Heid)
2) and the EasyAlign aligner that comes with it and has all the
functionalities that have been described in the previous mails (namely it
aligns, or accepts a previous alignment, so that one can easily incorporate
the results of manual revision into a powerful corpus querying system)

For those that would complain that the system is in Unix / Linux and
therefore not usable for naive users, the obvious solution is to create a
Web frontend as we did in COMPARA, see http://www.portugues.mct.pt/COMPARA

I'm not paid to make any advertisements to IMS-CWB nor to align texts for
other projects (although we do it ocasionally for some people when one of
the languages of the parallel texts is Portuguese), but I really think after
careful consideration of many other systems and approaches that this is the
best way to go.

People interested in technical details of exactly how the DISPARA setup
works can read as well, after the Web pages, the paper

Santos, Diana. "DISPARA, a system for distributing parallel corpora on the
Web", in Elisabete Ranchhod & Nuno J. Mamede (eds.), Advances in Natural
Language Processing (Third International Conference, PorTAL 2002, Faro,
Portugal, June 2002, Proceedings), LNAI 2389, Springer, 2002, pp.209-218.

and here is a soft presentation for non-technical users

Frankenberg-Garcia, Ana & Diana Santos. "Introducing COMPARA, the
Portuguese-English parallel translation corpus", paper presented at
CULT'2000, to appear in a volume of selected contributions, St.Jerome,
http://www.linguateca.pt/Diana/download/FrankenbergSantos.rtf
http://www.linguateca.pt/Diana/download/FrankenbergSantos.ps

The service we ocasionally do (NB! only when one of the languages is
Portuguese!!! -- to be fair, we have so far only tried with
English-Portuguese and Norwegian-Portuguese pairs...) is to accept texts in
text-only format (eg, TEXT1.po and TEXT1.en) already aligned by paragraph
(this means one paragraph per line in each text), submit them to EasyAlign
and send the output back sentence aligned. (Paragraphs can of course be
titles or other things.) I've prepared an example of text input and text
output for those interested in the service in
http://acdc.linguateca.pt/example_alignment.html. (Note that it has to
involve Portuguese as one of the languages)

However, I would warmly encourage people to actually use the IMS-CWB
themselves and create their own Web services. The advantages of using the
query power (also in translation corpora) are tremendous.

Diana
************************************************************************
Diana Santos			Computational processing of Portuguese

SINTEF Telecom & Informatics	Tel. (direct line) +47 22 06 73 12
Forskningsveien 1			Tel. +47 22 06 73 00
Box 124 Blindern			Fax. +47 22 06 73 50
N-0314 Oslo				Email: Diana.Santos at sintef.no
Norway				http://www.portugues.mct.pt/
************************************************************************