[Corpora-List] Tool for raw parallel corpora alignment

Emmanuel Prochasson emmanuel.prochasson at univ-nantes.fr
Tue Mar 17 11:56:53 UTC 2009


Emmanuel Prochasson a écrit :
> Thank you all for your quick and accurate answer, I'll have a look to 
> all the tools you provided me.
>   

Here's a quick summary about experiments concerning alignment from raw
parallel corpora. First, you have to know that my primary goal was to
get a ready to use processing stream for aligning short, raw parallel
corpora (in order to use those results in wider applications). I did not
spend many times to compile and run every project, just took a quick
look at some of them. In other word : those comments fit my own view and
needs.

I tried (before asking) GIZA++ but couldn't manage to use it
(compilation trouble, mostly) and though it's probably a relevant
software, it's far from being usable "out of the box" (I don't blame
anybody, clearly my own research software are hard to use for not
involved researchers).

I tried NATools (which run pretty fine, encoding trouble appart), but it
requires an already sentence-aligned corpora.

I tried Xalign, but they don't seem to provide sources (and require
pre-processing of corpora).

I tried to download MTTK. Though it's a free to distribute software, the
download link is lost in oblivion (you have to register to a Google
Group that doesnt seem to be active anymore).

I tried uplug : this is dope, exactly what I was looking for. It runs
out of the box, used many previously cited tools (such as hunalign or
GIZA++, alreadly compiled), does not require any installation and
provide many easy-to-use scripts. Actually, in 3 documented steps I
manage to align a short, test parallel corpora (extracted from
wikipedia) 10 minutes after unpacking the archive. It seems to support
regular encoding (works fine with latin-1 and utf8) and looks quite
language independent (I have to investigate for that). I uploaded some
samples (corpus and results) here :

http://eprochasson.free.fr/Corpora/

They look good enough for me (I didn't use any gold standard to evaluate 
them though). Different ways of aligning are supported (using more 
complex or simple method, using additionnal clues, using taggers...), 
those are the default alignment results.

http://sourceforge.net/projects/uplug

Hope that can be helpfull,

-- 
Emmanuel

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list