Arabic-L:LING:GALE Phase 2 Arabic Newswire Parallel Text
Dilworth Parkinson
dilworthparkinson at GMAIL.COM
Fri Nov 16 14:48:40 UTC 2012
------------------------------------------------------------------------
Arabic-L: Fri 16 Nov 2012
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:GALE Phase 2 Arabic Newswire Parallel Text
-------------------------Messages-----------------------------------
1)
Date: 16 Nov 2012
From:Linguistic Data Consortium ldc at ldc.upenn.edu
Subject:GALE Phase 2 Arabic Newswire Parallel Text
(3) GALE Phase 2 Arabic Newswire Parallel
Text<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T17>
was
developed by LDC. Along with other corpora, the parallel text in this
release comprised training data for Phase 2 of the DARPA GALE (Global
Autonomous Language Exploitation) Program. This corpus contains Modern
Standard Arabic source text and corresponding English translations selected
from newswire data collected in 2007 by LDC and transcribed by LDC or under
its direction.****
GALE Phase 2 Arabic Newswire Parallel Text includes 400 source-translation
pairs, comprising 181,704 tokens of Arabic source text and its English
translation. Data is drawn from six distinct Arabic newswire sources.: Al
Ahram, Al Hayat, Al-Quds Al-Arabi, An Nahar, Asharq Al-Awsat and Assabah.***
*
The files in this release were transcribed by LDC staff and/or
transcription vendors under contract to LDC in accordance with theQuick
Rich Transcription<http://projects.ldc.upenn.edu/gale/Transcription/Arabic-XTransQRTR.V3.pdf>
guidelines
developed by LDC. Transcribers indicated sentence boundaries in addition to
transcribing the text. Data was manually selected for translation according
to several criteria, including linguistic features, transcription features
and topic features. The transcribed and segmented files were then
reformatted into a human-readable translation format and assigned to
translation vendors. Translators followed LDC's Arabic to English
translation guidelines. Bilingual LDC staff performed quality control
procedures on the completed translations.****
GALE Phase 2 Arabic Newswire Parallel Text is distributed via web download.*
***
2012 Subscription Members will automatically receive two copies of this
data on disc. 2012 Standard Members may request a copy as part of their 16
free membership corpora. Non-members may license this data for US$1750.
------------------------------
--------------------------------------------------------------------------
End of Arabic-L: 16 Nov 2012
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20121116/0d986afc/attachment.htm>
More information about the Arabic-l
mailing list