Arabic-L:LING:New LDC resources

Dilworth Parkinson dilworth_parkinson at BYU.EDU
Fri Jul 27 19:21:13 UTC 2007


------------------------------------------------------------------------
Arabic-L: Fri 27 Jul 2007
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
            unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:New LDC resources

-------------------------Messages-----------------------------------
1)
Date: 27 Jul 2007
From:ldc at ldc.upenn.edu
Subject:New LDC resources


(2)  GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 is the  
first part of the three-part GALE Phase 1 Arabic Broadcast News  
Parallel Text, which, along with other corpora, was used as training  
data in year 1 (Phase 1) of the DARPA-funded GALE program. This  
corpus contains transcripts and English translations of 17 hours of  
Arabic broadcast news programming selected from a variety of  
sources.  A manual selection procedure was used to choose data  
appropriate for the GALE program, namely, news and conversation  
programs focusing on current events. Stories on topics such as  
sports, entertainment news, and stock market reports were excluded  
from the data set.

The selected audio snippets were then carefully transcribed by LDC  
annotators and professional transcription agencies following LDC's  
Quick Rich Transcription specification. Manual sentence units/ 
segments (SU) annotation was also performed as part of the  
transcription task. Three types of end of sentence SU are identified:
statement SU
question SU
incomplete SU
After transcription and SU annotation, the files were reformatted  
into a human-readable translation format and were then assigned to  
professional translators for careful translation. Translators  
followed LDC's GALE translation guidelines, which describe the makeup  
of the translation team, the source data format, the translation data  
format, best practices for translating certain linguistic features  
(such as names and speech disfluencies), and quality control  
procedures applied to completed translations.

All final data are in Tab Delimited Format (TDF). TDF is compatible  
with other transcription formats, such as the Transcriber format and  
AG format, and it is easy to process.  Each line of a TDF file  
corresponds to a speech segment and contains 13 tab delimited  
fields.  The source TDF file and its translation are the same except  
that the transcript in the source TDF is replaced by its English  
translation.  GALE Phase 1 Arabic Broadcast News Parallel Text - Part  
1 is distributed via web download.

2007 Subscription Members will automatically receive two copies of  
this corpus. 2007 Standard Members may request a copy as part of  
their 16 free membership corpora. Nonmembers may license this data  
for US$1500.

------------------------------------------------------------------------ 
--
End of Arabic-L:  27 Jul 2007
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20070727/8875fabf/attachment.htm>


More information about the Arabic-l mailing list