Arabic-L:LING:New Levantine Arabic Speech Corpus from LDC
Dilworth Parkinson
dilworth_parkinson at byu.edu
Mon Feb 14 20:47:03 UTC 2005
------------------------------------------------------------------------
-
Arabic-L: Mon 14 Feb 2005
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:New Levantine Arabic Speech Corpus from LDC
-------------------------Messages-----------------------------------
1)
Date: 141 Feb 2005
From:reposted from Corpora
Subject:New Levantine Arabic Speech Corpus from LDC
LDC2005S08
BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
LDC2005T01
Chinese Treebank 5.0
LDC2005S07
Levantine Arabic QT Training Data Set 3 Speech
LDC2005T03
Levantine Arabic QT Training Data Set 3 Transcripts
The Linguistic Data Consortium (LDC) would like to announce the
availability of four new corpora.
(1) BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
consists of transcribed, spontaneous speech, recorded from subjects
speaking in Levantine colloquial Arabic. Levantine Arabic is the
dialect of Arabic spoken by ordinary people in Lebanon, Jordan, Syria,
and Palestine. It is significantly different from Modern Standard
Arabic (MSA), in that it is a spoken rather than a written language. It
includes different word pronunciations, and even different words.
The corpus would be useful for anyone attempting to do speech
recognition in Levantine colloquial Arabic, including for speech
translation and spoken dialog systems. BBN/AUB DARPA Babylon Levantine
Arabic Speech and Transcripts is distributed on two DVD-ROM.
(2) Chinese Treebank 5.0 is a 500K word corpus of Chinese text with
syntactic bracketing. The corpus contains 824K Hanzi, 18K sentences,
and 890 data files. The data is drawn from three sources: Xinhua
(1994-1998), Information Services Department of HKSAR (1997), and
Sinorama magazine, Taiwan (1996-1998 & 2000-2001)
All files are GB encoded. Chinese Treebank 5.0 provides four versions
of files: bracketed, raw, segmented and POS tagged. The raw, segmented
and POS tagged versions are generated from the bracketed version and so
do not reflect the previous annotation stages. Chinese Treebank 5.0 is
distributed on one CD-ROM.
(3) Levantine Arabic QT Training Data Set 3 Speech contains 322
telephone conversations and totals about 50 hours of Levantine Arabic
speech. Participants were instructed to speak on set topics. Unlike
the previous training data corpora (Set 1 and 2) which are nearly 100%
Jordanian speakers, this corpus is mostly Lebanese (72%) plus a
combination of others Levantine speakers. Levantine Arabic QT Training
Data Set 3 Speech is distributed on one DVD-ROM.
(4) Levantine Arabic QT Training Data Set 3 Transcripts contains the
transcription for the Levantine Arabic QT Training Data Set 3. There
are 322 files is UTF-8 format. The corpus also contains a word list and
speaker information files. Levantine Arabic QT Training Data Set 3
Transcripts is distributed on one CD-ROM.
If you need further information, or would like to inquire about
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215
573 2175.
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
3600 Market Street Fax: (215) 573-2175
Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 http://www.ldc.upenn.edu
------------------------------------------------------------------------
--
End of Arabic-L: 14 Feb 2005
More information about the Arabic-l
mailing list