[Corpora-List] Some comments on aligners

Ute Römer ute.roemer at uni-koeln.de
Sat Sep 7 09:45:41 UTC 2002


Dear all,

Some weekend thoughts on Corpora List discussions -- in reply to Diana
Santos' recent posting.

I was just wondering, is it really "a waste of time" to discuss -- on an
email list the purpose of which it is, or ought to be, to exchange ideas on
certain specific topics and to help people solve corpus linguistic
problems -- special software tools, their use, and problems you encounter
while using them? And does it make a difference then whether the tools in
question are freely available or not? What's wrong with explicitly asking
for help with a certain program like Sampo Nevalainen did? I actually do not
very much like the idea of having to think twice before sending queries on
commercially available corpora and corpus analysis tools to the list and I
suspect that other list members might feel the same.

Have a good weekend all of you!

Best,
Ute


----- Original Message -----
From: "Santos Diana" <Diana.Santos at sintef.no>
To: <corpora at hd.uib.no>
Sent: Thursday, September 05, 2002 1:04 PM
Subject: [Corpora-List] Some comments on aligners


> Dear colleagues,
>
> It sounds to me somehow a waste of time and resources to be discussing
> aligners for a particular commercial application such as ParaConc in this
> list (I know that was the initial question...), given that there are so
many
> other systems that may cater for better functionalities of search in
> paralell corpora and which are moreover free and already existing.
>
> So, after some reflection, I decided, to prevent some naive readers of the
> list to conclude that the only existing aligners were the ones discussed
in
> the previous mail thread, to talk about our approach in COMPARA, basically
> to suggest to anyone involved in parallel corpora work to use
>
> 1) the IMS Corpus Workbench developed at Stuttgart (Stefan Evert and
Ulrich
> Heid)
> 2) and the EasyAlign aligner that comes with it and has all the
> functionalities that have been described in the previous mails (namely it
> aligns, or accepts a previous alignment, so that one can easily
incorporate
> the results of manual revision into a powerful corpus querying system)
>
> For those that would complain that the system is in Unix / Linux and
> therefore not usable for naive users, the obvious solution is to create a
> Web frontend as we did in COMPARA, see http://www.portugues.mct.pt/COMPARA
>
> I'm not paid to make any advertisements to IMS-CWB nor to align texts for
> other projects (although we do it ocasionally for some people when one of
> the languages of the parallel texts is Portuguese), but I really think
after
> careful consideration of many other systems and approaches that this is
the
> best way to go.
>
> People interested in technical details of exactly how the DISPARA setup
> works can read as well, after the Web pages, the paper
>
> Santos, Diana. "DISPARA, a system for distributing parallel corpora on the
> Web", in Elisabete Ranchhod & Nuno J. Mamede (eds.), Advances in Natural
> Language Processing (Third International Conference, PorTAL 2002, Faro,
> Portugal, June 2002, Proceedings), LNAI 2389, Springer, 2002, pp.209-218.
>
> and here is a soft presentation for non-technical users
>
> Frankenberg-Garcia, Ana & Diana Santos. "Introducing COMPARA, the
> Portuguese-English parallel translation corpus", paper presented at
> CULT'2000, to appear in a volume of selected contributions, St.Jerome,
> http://www.linguateca.pt/Diana/download/FrankenbergSantos.rtf
> http://www.linguateca.pt/Diana/download/FrankenbergSantos.ps
>
> The service we ocasionally do (NB! only when one of the languages is
> Portuguese!!! -- to be fair, we have so far only tried with
> English-Portuguese and Norwegian-Portuguese pairs...) is to accept texts
in
> text-only format (eg, TEXT1.po and TEXT1.en) already aligned by paragraph
> (this means one paragraph per line in each text), submit them to EasyAlign
> and send the output back sentence aligned. (Paragraphs can of course be
> titles or other things.) I've prepared an example of text input and text
> output for those interested in the service in
> http://acdc.linguateca.pt/example_alignment.html. (Note that it has to
> involve Portuguese as one of the languages)
>
> However, I would warmly encourage people to actually use the IMS-CWB
> themselves and create their own Web services. The advantages of using the
> query power (also in translation corpora) are tremendous.
>
> Diana
> ************************************************************************
> Diana Santos Computational processing of Portuguese
>
> SINTEF Telecom & Informatics Tel. (direct line) +47 22 06 73 12
> Forskningsveien 1 Tel. +47 22 06 73 00
> Box 124 Blindern Fax. +47 22 06 73 50
> N-0314 Oslo Email: Diana.Santos at sintef.no
> Norway http://www.portugues.mct.pt/
> ************************************************************************
>
>
>
>



More information about the Corpora mailing list