[Corpora-List] Some comments on aligners

Santos Diana Diana.Santos at sintef.no
Sat Sep 7 14:52:38 UTC 2002


Dear Ute and Sampo, and corpora-list members in general,

I believe my message was doubly misunderstood.

About Ute's remark: 

I was not criticizing Sampo's question about a particular commercial
aligner, I was suggesting that the answers to the question were more
encompassing - in fact Sampo's message later mentioned he had after all
found another program (so he was not THAT concerned with ParaConc after
all). That´s why I posted my message: to give other people in the list the
idea that there are other more powerful tools out there. It was not to
criticize people for asking specific questions.

About Sampos's answer, and this is where I'm most sorry for not having been
understood, I was not discussing the use of IMS-CWB for large projects,
where I think its advantages are uncontroversial.

On the contrary, I was suggesting (and was actually hoping to have
demonstrated) that also for cases like the ones discussed by Sampo: some
students working with their own corpora, it was also the best way to
proceed.

I tried to explain that it was easy to setup a Web service that would align
texts for the user, let them revise them if needed, and then show them in an
easy, user-friendly, and platform-independent way using a Web interface to
the IMS-CWB, as we do for COMPARA. 

I was not suggesting that everyone who wanted to look at parallel corpora
had to copy and devise a system such as COMPARA, which was thought from the
beginning for a large range of users and to be made publically available.

Rather, I was suggesting another kind of service (incidentally, that we are
also planning to offer at Linguateca, a distributed resource center for
Portuguese, at the forthcoming pole in Porto, with Belinda Maia), namely,
the possibility of having different students and researchers working in
their own corpora with a common infrastructure.

If you have ONE student, it may be the same work for you to tell him to go
and fetch a Windows-based program with limited functionalities, etc. But if
you have more than one student or user, it would be to your advantage that
all of them use the same tools and input the texts the same way so that you
could even reuse (or at least look at) the texts they are using, all with
the same Web functionality. (Even if for copyright reasons it would have to
be password protected, that is a straightforward matter...)

So, that was what I was proposing: Set up a simple service based on IMS-CWB
that aligned the text and displayed them with a Web interface, which they
can then access from wherever. (Then it would be up to you to define what is
a "concordancer that would be relatively simple in use and not too picky
with texts to be used as a corpus". My experience is that the second
criterion is already met by the IMS-CWB, for we have used large amounts of
all kinds of text in our Portuguese text at the AC/DC project,
http://acdc.linguateca.pt/acesso/info_acesso_English.html.)

I won't be bothering the list with further technical details...

Thank you Ute and Sampo for your answers so that I could have another go at
this subject :-)
Diana

> -----Original Message-----
> From: Ute Römer [mailto:ute.roemer at uni-koeln.de]
> Sent: 7. september 2002 11:46
> To: corpora at hd.uib.no
> Subject: Re: [Corpora-List] Some comments on aligners
> 
> 
> Dear all,
> 
> Some weekend thoughts on Corpora List discussions -- in reply to Diana
> Santos' recent posting.
> 
> I was just wondering, is it really "a waste of time" to 
> discuss -- on an
> email list the purpose of which it is, or ought to be, to 
> exchange ideas on
> certain specific topics and to help people solve corpus linguistic
> problems -- special software tools, their use, and problems 
> you encounter
> while using them? And does it make a difference then whether 
> the tools in
> question are freely available or not? What's wrong with 
> explicitly asking
> for help with a certain program like Sampo Nevalainen did? I 
> actually do not
> very much like the idea of having to think twice before 
> sending queries on
> commercially available corpora and corpus analysis tools to 
> the list and I
> suspect that other list members might feel the same.
> 
> Have a good weekend all of you!
> 
> Best,
> Ute
> 
> 
> ----- Original Message -----
> From: "Santos Diana" <Diana.Santos at sintef.no>
> To: <corpora at hd.uib.no>
> Sent: Thursday, September 05, 2002 1:04 PM
> Subject: [Corpora-List] Some comments on aligners
> 
> 
> > Dear colleagues,
> >
> > It sounds to me somehow a waste of time and resources to be 
> discussing
> > aligners for a particular commercial application such as 
> ParaConc in this
> > list (I know that was the initial question...), given that 
> there are so
> many
> > other systems that may cater for better functionalities of search in
> > paralell corpora and which are moreover free and already existing.
> >
> > So, after some reflection, I decided, to prevent some naive 
> readers of the
> > list to conclude that the only existing aligners were the 
> ones discussed
> in
> > the previous mail thread, to talk about our approach in 
> COMPARA, basically
> > to suggest to anyone involved in parallel corpora work to use
> >
> > 1) the IMS Corpus Workbench developed at Stuttgart (Stefan Evert and
> Ulrich
> > Heid)
> > 2) and the EasyAlign aligner that comes with it and has all the
> > functionalities that have been described in the previous 
> mails (namely it
> > aligns, or accepts a previous alignment, so that one can easily
> incorporate
> > the results of manual revision into a powerful corpus 
> querying system)
> >
> > For those that would complain that the system is in Unix / Linux and
> > therefore not usable for naive users, the obvious solution 
> is to create a
> > Web frontend as we did in COMPARA, see 
http://www.portugues.mct.pt/COMPARA
>
> I'm not paid to make any advertisements to IMS-CWB nor to align texts for
> other projects (although we do it ocasionally for some people when one of
> the languages of the parallel texts is Portuguese), but I really think
after
> careful consideration of many other systems and approaches that this is
the
> best way to go.
>
> People interested in technical details of exactly how the DISPARA setup
> works can read as well, after the Web pages, the paper
>
> Santos, Diana. "DISPARA, a system for distributing parallel corpora on the
> Web", in Elisabete Ranchhod & Nuno J. Mamede (eds.), Advances in Natural
> Language Processing (Third International Conference, PorTAL 2002, Faro,
> Portugal, June 2002, Proceedings), LNAI 2389, Springer, 2002, pp.209-218.
>
> and here is a soft presentation for non-technical users
>
> Frankenberg-Garcia, Ana & Diana Santos. "Introducing COMPARA, the
> Portuguese-English parallel translation corpus", paper presented at
> CULT'2000, to appear in a volume of selected contributions, St.Jerome,
> http://www.linguateca.pt/Diana/download/FrankenbergSantos.rtf
> http://www.linguateca.pt/Diana/download/FrankenbergSantos.ps
>
> The service we ocasionally do (NB! only when one of the languages is
> Portuguese!!! -- to be fair, we have so far only tried with
> English-Portuguese and Norwegian-Portuguese pairs...) is to accept texts
in
> text-only format (eg, TEXT1.po and TEXT1.en) already aligned by paragraph
> (this means one paragraph per line in each text), submit them to EasyAlign
> and send the output back sentence aligned. (Paragraphs can of course be
> titles or other things.) I've prepared an example of text input and text
> output for those interested in the service in
> http://acdc.linguateca.pt/example_alignment.html. (Note that it has to
> involve Portuguese as one of the languages)
>
> However, I would warmly encourage people to actually use the IMS-CWB
> themselves and create their own Web services. The advantages of using the
> query power (also in translation corpora) are tremendous.
>
> Diana
> ************************************************************************
> Diana Santos Computational processing of Portuguese
>
> SINTEF Telecom & Informatics Tel. (direct line) +47 22 06 73 12
> Forskningsveien 1 Tel. +47 22 06 73 00
> Box 124 Blindern Fax. +47 22 06 73 50
> N-0314 Oslo Email: Diana.Santos at sintef.no
> Norway http://www.portugues.mct.pt/
> ************************************************************************
>
>
>
>



More information about the Corpora mailing list