Arabic-L:LING:GALE Word Alignment Broadcast Training Part 1

Fri Sep 19 04:52:30 UTC 2014

------------------------------------------------------------------------
Arabic-L: Fri 19 Sep 2014
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
           unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject: GALE Word Alignment Broadcast Training Part 1

-------------------------Messages-----------------------------------
1)
Date: 19 Sep 2014
From: reposted from LDC
Subject: GALE Word Alignment Broadcast Training Part 1

(2) GALE Arabic-English Word Alignment -- Broadcast Training Part 1
<https://catalog.ldc.upenn.edu/LDC2014T19> was developed by LDC and
contains 267,257 tokens of word aligned Arabic and English parallel text
enriched with linguistic tags. This material was used as training data in
the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the
incorporation of linguistic knowledge in word aligned text as a means to
improve automatic word alignment and machine translation quality. This is
accomplished with two annotation schemes: alignment and tagging. Alignment
identifies minimum translation units and translation relations by using
minimum-match and attachment annotation approaches. A set of word tags and
alignment link tags are designed in the tagging scheme to describe these
translation units and relations. Tagging adds contextual, syntactic and
language-specific features to the alignment annotation.

This release consists of Arabic source broadcast news and broadcast
conversation data collected by LDC from 2007-2009. The distribution by
genre, words, tokens and segments appears below:

Language

Genre

Files

Words

Tokens

Segments

Arabic

BC

231

79,485

103,816

4,114

Arabic

BN

92

131,789

163,441

7,227

Totals

323

211,274

267,257

11,341

Note that word count is based on the untokenized Arabic source, and token
count is based on the tokenized Arabic source.

The Arabic word alignment tasks consisted of the following components:

Normalizing tokenized tokens as needed

Identifying different types of links

Identifying sentence segments not suitable for annotation

Tagging unmatched words attached to other words or phrases

GALE Arabic-English Word Alignment -- Broadcast Training Part 1 is
distributed via web download.

2014 Subscription Members will automatically receive two copies of this
data on disc.  2014 Standard Members may request a copy as part of their 16
free membership corpora.  Non-members may license this data for US$1750.

--------------------------------------------------------------------------
End of Arabic-L: 19 Sep 2014
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20140919/83b9cb7a/attachment.htm>