Arabic-L:LING:New LDC materials

Dilworth Parkinson dilworth_parkinson at BYU.EDU
Fri Jun 27 21:56:58 UTC 2008


------------------------------------------------------------------------
Arabic-L: Fri 27 Jun 2008
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
            unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:New LDC materials

-------------------------Messages-----------------------------------
1)
Date: 27 Jun 2008
From:reposted from CORPORA
Subject:New LDC materials
New Publications

(1) GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 is the  
second part of the three-part GALE Phase 1 Arabic Broadcast News  
Parallel Text, which, along with other corpora, was used as training  
data in year 1 (Phase 1) of the DARPA-funded GALE program. The corpus  
contains transcripts and English translations of 10.7 hours of Arabic  
broadcast news programming selected from various sources. This corpus  
does not contain the audio files from which the transcripts and  
translations were generated.

The Arabic broadcast news recordings were selected from four sources  
and four different programs.   A manual selection procedure was used  
to choose data appropriate for the GALE program, namely, news and  
conversation programs focusing on current events. Stories on topics  
such as sports, entertainment news, and stock market reports were  
excluded from the data set.  Manual sentence units/segments (SU)  
annotation was also performed on a subset of files following LDC's  
Quick Rich Transcription specification. Three types of end of sentence  
SU were identified: statement SU, question SU, and incomplete SU.

After transcription and SU annotation, they were reformatted into a  
human-readable translation format, and the files were then assigned to  
professional translators for careful translation. Translators followed  
LDC's GALE Translation guidelines, which describe the makeup of the  
translation team, the source, data format, the translation data  
format, best practices for translating certain linguistic features  
(such as names and speech disfluencies), and quality control  
procedures applied to completed translations.

Linguistic Data Consortium Phone: (215) 573-1275 University of  
Pennsylvania Fax: (215) 573-2175 3600 Market St., Suite 810 ldc at ldc.upenn.edu 
  Philadelphia, PA 19104 USA http://www.ldc.upenn.edu

--------------------------------------------------------------------------
End of Arabic-L:  27 Jun 2008



More information about the Arabic-l mailing list