Arabic-L:LING:GALE Arabic-English Word Alignment Training Part 3-Web

Dilworth Parkinson dilworthparkinson at GMAIL.COM
Fri Jul 18 21:24:31 UTC 2014


------------------------------------------------------------------------
Arabic-L: Fri 18 Jul 2014
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
           unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject: GALE Arabic-English Word Alignment Training Part 3-Web

-------------------------Messages-----------------------------------
1)
Date: 18 Jul 2014
From: reposted from LDC <ldc at ldc.upenn.edu>
Subject: GALE Arabic-English Word Alignment Training Part 3-Web

(2) GALE Arabic-English Word Alignment Training Part 3 -- Web
<https://catalog.ldc.upenn.edu/LDC2014T14> was developed by LDC and
contains 217,158 tokens of word aligned Arabic and English parallel text
enriched with linguistic tags. This material was used as training data in
the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the
incorporation of linguistic knowledge in word aligned text as a means to
improve automatic word alignment and machine translation quality. This is
accomplished with two annotation schemes: alignment and tagging. Alignment
identifies minimum translation units and translation relations by using
minimum-match and attachment annotation approaches. A set of word tags and
alignment link tags are designed in the tagging scheme to describe these
translation units and relations. Tagging adds contextual, syntactic and
language-specific features to the alignment annotation.

Other releases available in this series are:

GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire
and Web (LDC2012T16 <http://catalog.ldc.upenn.edu/LDC2012T16>)

GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire
(LDC2012T20 <http://catalog.ldc.upenn.edu/LDC2012T20>)

GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web (
LDC2012T24 <http://catalog.ldc.upenn.edu/LDC2012T24>)

GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web (
LDC2013T05 <http://catalog.ldc.upenn.edu/LDC2013T05>)

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part
1 (LDC2013T23 <http://catalog.ldc.upenn.edu/LDC2013T23>)

GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web (
LDC2014T05 <http://catalog.ldc.upenn.edu/LDC2014T05>)

GALE Arabic-English Word Alignment Training Part 2 -- Newswire (LDC2014T10
<http://catalog.ldc.upenn.edu/LDC2014T10>)

This release consists of Arabic source web data collected by LDC. The
distribution by genre, words, character tokens and segments appears below:

Language

Genre

Files

Words

CharTokens

Segments

Arabic

WB

2,449

154,144

217,158

7,332

Note that word count is based on the untokenized Arabic source, and token
count is based on the tokenized Arabic source.

The Arabic word alignment tasks consisted of the following components:

Normalizing tokenized tokens as needed

Identifying different types of links

Identifying sentence segments not suitable for annotation

Tagging unmatched words attached to other words or phrases

GALE Arabic-English Word Alignment Training Part 3 -- Web is distributed
via web download.

2014 Subscription Members will automatically receive two copies of this
data on disc.  2014 Standard Members may request a copy as part of their 16
free membership corpora.  Non-members may license this data for US$1750.


--------------------------------------------------------------------------
End of Arabic-L: 18 Jul 2014
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20140718/116b25b9/attachment.htm>


More information about the Arabic-l mailing list