[Corpora-List] Manually annotated alignments

Chris Callison-Burch callison-burch at ed.ac.uk
Fri May 18 15:13:50 UTC 2007


Dear Lexi,

The LDC has word alignments for Arabic-English and Chinese-English as  
part of its releases for the first year of the GALE program.  Check  
out catalog numbers  LDC2006G09 and LDC2006E86.

"The Arabic text was selected from Arabic Treebank Part 3, all the  
files are from An Nahar. There are 58 files, 1,183 sentences, and 30K  
Arabic words, in this release.

The Chinese source text was selected from Chinese Treebank, all the  
files are from Xinhua News Agency. There are 159 files, 1,882  
sentences, and 49K Chinese words (according to Chinese Treebank word  
segmentation), in this release."

--Chris

On May 17, 2007, at 7:52 PM, Alexandra Birch wrote:

> Hi there,
>
> I am searching for manually annotated word/phrase alignments from
> parallel corpora. So far I have discovered:
>
> ACL2003 shared task
> http://www.cs.unt.edu/~rada/wpt/
> Romanian - English (Mihalcea & Pedersen 2003)
> English - French (Och & Ney 2000)
>
> ACL2005 shared task
> http://www.cse.unt.edu/~rada/wpt05/
> English - Inuktitut
> English - Hindi
>
> EPPS Word Alignment Trial and Test Set
> Spanish - English (500 sentences)
> http://gps-tsc.upc.es/veu/LR/epps_ensp_alignref.php3
>
> I will keep looking but  I would appreciate it if anyone could
> inform me of other resources they know about.
>
> Thank you
>
> Lexi
>



More information about the Corpora mailing list