[Corpora-List] We present Yalign - A tool for extracting parallel sentences from comparable corpora.
Elías Andrawos
eandrawos at machinalis.com
Wed Sep 18 12:13:01 UTC 2013
Statistical Machine Translation relies on parallel corpora (eg.. europarl)
for training translation models. However these corpora are limited and take
time to create. Yalign is designed to automate this process by finding
sentences that are close translation matches from comparable corpora. This
opens up avenues for harvesting parallel corpora from sources like
translated documents and the web.
Project home: http://yalign.machinalis.com
Code: http://www.github.com/machinalis/yalign
Give Yalign a try and align some text extracted from Freebase:
http://yalign.machinalis.com/demo
Yalign is implemented using:
+ A sentence similarity metric. Given two sentences it produces a rough
estimate (a number between 0 and 1) of how likely are those two sentences
to be a translation of each other.
+ A sequece aligner, such that given two documents (a list of sentences) it
produces an alignment which maximizes the sum of the individual (per
sentence pair) similarities.
So Yalign’s main algorithm is actually a pretty wrapper to a standard
sequence alignment algorithm.
For the sequence alignment Yalign uses a variation of the Needleman-Wunch
algorithm to find an optimal alignment between the sentences in two given
documents. On the good side, the algorithm has polynomial time worst case
complexity and it produces an optimal alignment. On the bad side it can’t
handle alignments that cross each other or alignments from two sentences
into a single one (even tough is possible to modify the current
implementation to handle those cases).
Since the sentence similarity is a computationally expensive operation, the
mentioned “variation” on the Needleman-Wunch algorithm consists in using
the A* to explore the search space instead of using the classical dynamic
programming aproach (which would always requiere N * M calls to the
sentence similarity metric).
After the alignment, only sentences that have a high probability of being
translations are included in the final alignment. Ie, the result is
filtered in order to deliver high quality alignments. To do this, a
threshold value is used such that if the sentence similarity metric is bad
enough that pair is excluded.
For the sentence similarity metric the algorithm uses a statistical
classifier’s likelihood output and adapts it into the 0-1 range.
The classifier is trained to determine if a pair of sentences are
translations of each other or not (a binary value). The particular
classifier used for this project is a Support Vector Machine. Besides being
excelent classifiers, SVMs can provide a distance to the separation
hyperplane during classification, and this distance can be easily modified
using a Sigmoid Function to return a likelihood between 0 and 1.
The use of a classifier means that the quality of the alignment is
dependent not only on the input but also on the quality of the trained
classifier.
Join our mailing list: http://groups.google.com/group/yalign
See you soon!
floss [at] machinalis [dot] com
--
Elías Andrawos
Machinalis - http://machinalis.com/
+54 (0351) 152-750975
+54 (0351) 4315739
Skype: eandrawos
Gtalk: eandrawos at machinalis.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130918/763efdd8/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list