Arabic-L:LING:New LDC materials
Dilworth Parkinson
dilworth_parkinson at BYU.EDU
Fri Jun 27 21:56:58 UTC 2008
------------------------------------------------------------------------
Arabic-L: Fri 27 Jun 2008
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:New LDC materials
-------------------------Messages-----------------------------------
1)
Date: 27 Jun 2008
From:reposted from CORPORA
Subject:New LDC materials
New Publications
(1) GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 is the
second part of the three-part GALE Phase 1 Arabic Broadcast News
Parallel Text, which, along with other corpora, was used as training
data in year 1 (Phase 1) of the DARPA-funded GALE program. The corpus
contains transcripts and English translations of 10.7 hours of Arabic
broadcast news programming selected from various sources. This corpus
does not contain the audio files from which the transcripts and
translations were generated.
The Arabic broadcast news recordings were selected from four sources
and four different programs. A manual selection procedure was used
to choose data appropriate for the GALE program, namely, news and
conversation programs focusing on current events. Stories on topics
such as sports, entertainment news, and stock market reports were
excluded from the data set. Manual sentence units/segments (SU)
annotation was also performed on a subset of files following LDC's
Quick Rich Transcription specification. Three types of end of sentence
SU were identified: statement SU, question SU, and incomplete SU.
After transcription and SU annotation, they were reformatted into a
human-readable translation format, and the files were then assigned to
professional translators for careful translation. Translators followed
LDC's GALE Translation guidelines, which describe the makeup of the
translation team, the source, data format, the translation data
format, best practices for translating certain linguistic features
(such as names and speech disfluencies), and quality control
procedures applied to completed translations.
Linguistic Data Consortium Phone: (215) 573-1275 University of
Pennsylvania Fax: (215) 573-2175 3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
--------------------------------------------------------------------------
End of Arabic-L: 27 Jun 2008
More information about the Arabic-l
mailing list