[Corpora-List] [CORPORA-List] RE: Evaluating Sentence Aligners
Olivier Kraif
olivier.kraif at tele2.fr
Mon Nov 19 15:33:40 UTC 2007
Dear Eric,
sorry for this very late answer : I hope it will not come too late for
you. I have participated to Arcade2 campaign, using a system called
Alinea
(http://w3.u-grenoble3.fr/kraif/index.php?option=com_content&task=view&id=27&Itemid=43).
The problem of evaluating such systems is quite complex, since you have
to take into account various aspects : what kind of input do they
process ? Do they need specific pre-processing ? Do they need complex
manual tuning of parameters ? Do they require specific linguistic data
(such as bilingual lexicons, translitteration tables, etc.) ? Are the
results stable for different types of corpora ?
Alinea was evaluated on 'distant' language pairs, and it has yielded
very stable results (with an average F-measure of 87%, aligning French
with non-latin languages as ar, el, fa, ru, zh) for already segmented
corpus. Other systems have encountered a drastic degradation of their
results for some language pairs, but Alinea did not.
These results were obtained without fine tuning of parameters, and show
that surface clues (sentence lengths and even identical chains) can be
useful, even for non-related languages that don't share the same
alphabet. Alinea allows improving these results by adding specific
linguistic data (bilingual lexicon, translitteration) or using a large
parallel corpus (Alinea can be trained by extracting automatically
lexical correspondances from the raw aligned corpus).
The Arcade 2 campaign showed a very important (and rather obvious) point
: the results of an aligner are closely related to the way the texts are
segmented : if the segmentation in source and target languages are
similar, Alinea will behave very well. If it is not the case, you may
have a lot of manual corrections to do afterwards. Thus, as the
segmentation rules may depend on the peculiarities of your corpus, it
may be judicious to do it upstream, using an adapted sentence splitter.
As the Alinea sentence splitter is made for latin-written languages, one
would rather check wether the sentence numbers are similar (that is
+/-30%) before launching the aligning process.
Best regards
Olivier
> Hello CORPORA List,
>
> I’d be interested to hear from colleagues who are evaluating/have
> evaluated automatic sentence alignment in parallel corpora. I’m
> especially interested in work with “distant” languages (e.g., Ar-En,
> Zh-En). I’m thinking some of the methods you have used would reflect
> Arcade 2 (Chiao, Kraif, et al., 2006), (Rosen, 2005) or (Singh &
> Husain 2005). However, it may be you’re only working in a specific
> domain, or at a small scale (comparing less than 5 aligners). I’d be
> curious to hear about your experiences, since I’ve been testing
> sentence aligners on text in a government/foreign affairs domain. I’d
> also welcome suggestions from anybody who has tried to incorporate
> usability testing into an evaluation of automatic sentence alignment:
> for example, you may have monitored how much manual correction users
> had to do after the alignment.
>
> Thanks,
>
> Eric Garbin
>
> Computational Linguist
>
> The Technology Development Group
>
> www.thetdgroup.com <http://www.thetdgroup.com>
>
> 571-262-2693
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list