Arabic-L:LING:GALE Phase 2 from LDC

Dilworth Parkinson dilworthparkinson at GMAIL.COM
Sat Aug 25 07:02:57 UTC 2012


------------------------------------------------------------------------
Arabic-L: Sat 25 Aug 2012
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
           unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:GALE Phase 2 from LDC

-------------------------Messages-----------------------------------
1)
Date: 25 Aug 2012
From:Linguistic Data Consortium ldc at ldc.upenn.edu
Subject:GALE Phase 2 from LDC

(2) GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2
was developed by LDC. Along with other corpora, the parallel text in
this release comprised training data for Phase 2 of the DARPA GALE
(Global Autonomous Language Exploitation) Program. This corpus
contains Modern Standard Arabic source text and corresponding English
translations selected from broadcast conversation (BC) data collected
by LDC between 2004 and 2007 and transcribed by LDC or under its
direction.

GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2
includes 29 source-translation document pairs, comprising 169,488
words of Arabic source text and its English translation. Data is drawn
from eight distinct Arabic programs broadcast between 2004 and 2007
from Aljazeera, a regional broadcast programmer based in Doha, Qatar;
and Nile TV, an Egyptian broadcaster. The programs in this release
focus on current events topics.

The files in this release were transcribed by LDC staff and/or
transcription vendors under contract to LDC in accordance with the
Quick Rich Transcription guidelines developed by LDC. Transcribers
indicated sentence boundaries in addition to transcribing the text.
Data was manually selected for translation according to several
criteria, including linguistic features, transcription features and
topic features. The transcribed and segmented files were then
reformatted into a human-readable translation format and assigned to
translation vendors. Translators followed LDC's Arabic to English
translation guidelines. Bilingual LDC staff performed quality control
procedures in the completed translations.

--------------------------------------------------------------------------
End of Arabic-L: 25 Aug 2012



More information about the Arabic-l mailing list