Arabic-L:LING:LDC's CALE Phase 2
Dilworth Parkinson
dilworthparkinson at GMAIL.COM
Tue May 29 19:24:15 UTC 2012
------------------------------------------------------------------------
Arabic-L: Tue 29 May 2012
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:LDC's CALE Phase 2
-------------------------Messages-----------------------------------
1)
Date: 29 May 2012
From:Linguistic Data Consortium ldc at ldc.upenn.edu (reposted from their
newsletter)
Subject:LDC's CALE Phase 2
(2) GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1
was developed by LDC. Along with other corpora, the parallel text in
this release comprised machine translation training data for Phase 2
of the DARPA GALE (Global Autonomous Language Exploitation) Program.
This corpus contains Modern Standard Arabic source text and
corresponding English translations selected from broadcast
conversation (BC) data collected by LDC between 2004 and 2007 and
transcribed by LDC or under its direction.
GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1
includes 36 source-translation document pairs, comprising 169,109
words of Arabic source text and its English translation. Data is drawn
from thirteen distinct Arabic programs broadcast between 2004 and 2007
from the following sources: Al Alam News Channel, Aljazeera, Dubai TV,
Oman TV, and Radio Sawa. Broadcast conversation programming is
generally more interactive than traditional news broadcasts and
includes talk shows, interviews, call-in programs and roundtable
discussions. The programs in this release focus on current events
topics.
The files in this release were transcribed by LDC staff and/or
transcription vendors under contract to LDC in accordance with Quick
Rich Transcription guidelines developed by LDC. Transcribers indicated
sentence boundaries in addition to transcribing the text. Data was
manually selected for translation according to several criteria,
including linguistic features, transcription features and topic
features. The transcribed and segmented files were then reformatted
into a human-readable translation format and assigned to translation
vendors. Translators followed LDC's Arabic to English translation
guidelines which are included with this release. Bilingual LDC staff
performed quality control procedures on the completed translations.
Source data and translations are distributed in TDF format. All data
are encoded in UTF8.
GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 is
distributed via web download.
2012 Subscription Members will automatically receive one copy of this
data on disc. 2012 Standard Members may request a copy as part of
their 16 free membership corpora. Non-members may license this data
for US$1750.
--------------------------------------------------------------------------
End of Arabic-L: 29 May 2012
More information about the Arabic-l
mailing list