Arabic-L:LING:GALE Arabic-English World Alignment Training Part 1

Dilworth Parkinson dilworthparkinson at GMAIL.COM
Mon Mar 17 22:42:12 UTC 2014


------------------------------------------------------------------------
Arabic-L: Mon 17 Mar 2014
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
           unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject: GALE Arabic-English World Alignment Training Part 1

-------------------------Messages-----------------------------------
1)
Date: 17 Mar 2014
From: Linguistic Data Consortium <ldc at ldc.upenn.edu>
Subject: GALE Arabic-English World Alignment Training Part 1

(1) GALE Arabic-English Word Alignment Training Part 1 -- Newswire and
Web<http://catalog.ldc.upenn.edu/LDC2014T05> was
developed by LDC and contains 344,680 tokens of word aligned Arabic and
English parallel text enriched with linguistic tags. This material was used
as training data in the DARPA
GALE<https://www.ldc.upenn.edu/collaborations/past-projects>(Global
Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the
incorporation of linguistic knowledge in word aligned text as a means to
improve automatic word alignment and machine translation quality. This is
accomplished with two annotation schemes: alignment and tagging. Alignment
identifies minimum translation units and translation relations by using
minimum-match and attachment annotation approaches. A set of word tags and
alignment link tags are designed in the tagging scheme to describe these
translation units and relations. Tagging adds contextual, syntactic and
language-specific features to the alignment annotation.

This release consists of Arabic source newswire and web data collected by
LDC in 2006 - 2008. The distribution by genre, words, character tokens and
segments appears below:

Language

Genre

Docs

Words

CharTokens

Segments

Arabic

WB

119

59,696

81,620

4,383

Arabic

NW

717

198,621

263,060

8,423

Note that word count is based on the untokenized Arabic source, and token
count is based on the tokenized Arabic source.

The Arabic word alignment tasks consisted of the following components:


   - Normalizing  tokenized tokens as needed
   - Identifying different types of links
   - Identifying sentence segments not suitable for annotation
   - Tagging unmatched words attached to other words or phrases


GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web is
distributed via web download.

2014 Subscription Members will automatically receive two copies of this
data on disc. 2014 Standard Members may request a copy as part of their 16
free membership corpora.  Non-members may license this data for US$1750.


--------------------------------------------------------------------------
End of Arabic-L: 17 Mar 2014
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20140317/fb0f859f/attachment.htm>


More information about the Arabic-l mailing list