[Corpora-List] Problem with Microsoft Bilingual Aligner

Mon Jun 25 18:27:30 UTC 2012

Hi all,

Working on building a parallel Urdu-English corpus using Microsoft aligner,
I faced some problems:
I tried to extract parallel sentences from about 900 document pairs, which
almost 60% of sentences are parallel, and could be introduced as
alignments. But using Microsoft Aligner I got just a few lines as output
(almost all of the output files are empty).
I tried BleuAlign and I got different results. Almost all of the parallel
sentences extracted and introduced as aligned sentences. This means that
 Microsoft aligner doesn't work on my files.
On the other hand I tried a ready to use English-Urdu corpus which is
manually translated for automatic translation of English-Urdu MTs. This
corpus available here:
http://www.crulp.org/software/ling_resources/UrduNepaliEnglishParallelCorpus.htm

Running Microsoft aligner on this corpus I got surprising results. all the
sentences introduced as aligned sentences.
I also checked the encoding of files, it is UTF8.

So what could be the reason of getting bad results on my documents? encodng
of files, Microsoft aligner, or something other?

Any suggestions?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120625/cb4b3009/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora