[Corpora-List] Manually annotated alignments
Chris Callison-Burch
callison-burch at ed.ac.uk
Fri May 18 15:13:50 UTC 2007
Dear Lexi,
The LDC has word alignments for Arabic-English and Chinese-English as
part of its releases for the first year of the GALE program. Check
out catalog numbers LDC2006G09 and LDC2006E86.
"The Arabic text was selected from Arabic Treebank Part 3, all the
files are from An Nahar. There are 58 files, 1,183 sentences, and 30K
Arabic words, in this release.
The Chinese source text was selected from Chinese Treebank, all the
files are from Xinhua News Agency. There are 159 files, 1,882
sentences, and 49K Chinese words (according to Chinese Treebank word
segmentation), in this release."
--Chris
On May 17, 2007, at 7:52 PM, Alexandra Birch wrote:
> Hi there,
>
> I am searching for manually annotated word/phrase alignments from
> parallel corpora. So far I have discovered:
>
> ACL2003 shared task
> http://www.cs.unt.edu/~rada/wpt/
> Romanian - English (Mihalcea & Pedersen 2003)
> English - French (Och & Ney 2000)
>
> ACL2005 shared task
> http://www.cse.unt.edu/~rada/wpt05/
> English - Inuktitut
> English - Hindi
>
> EPPS Word Alignment Trial and Test Set
> Spanish - English (500 sentences)
> http://gps-tsc.upc.es/veu/LR/epps_ensp_alignref.php3
>
> I will keep looking but I would appreciate it if anyone could
> inform me of other resources they know about.
>
> Thank you
>
> Lexi
>
More information about the Corpora
mailing list