Hi all,<div><br></div><div>Working on building a parallel Urdu-English corpus using Microsoft aligner, I faced some problems:</div><div>I tried to extract parallel sentences from about 900 document pairs, which almost 60% of sentences are parallel, and could be introduced as alignments. But using Microsoft Aligner I got just a few lines as output (almost all of the output files are empty).</div>
<div>I tried BleuAlign and I got different results. Almost all of the parallel sentences extracted and introduced as aligned sentences. This means that Microsoft aligner doesn't work on my files.</div><div>On the other hand I tried a ready to use English-Urdu corpus which is manually translated for automatic translation of English-Urdu MTs. This corpus available here: <a href="http://www.crulp.org/software/ling_resources/UrduNepaliEnglishParallelCorpus.htm">http://www.crulp.org/software/ling_resources/UrduNepaliEnglishParallelCorpus.htm</a></div>
<div><br></div><div>Running Microsoft aligner on this corpus I got surprising results. all the sentences introduced as aligned sentences. </div><div>I also checked the encoding of files, it is UTF8.</div><div><br></div><div>
So what could be the reason of getting bad results on my documents? encodng of files, Microsoft aligner, or something other?</div><div><br></div><div>Any suggestions?</div>