Arabic-L:LING:New Levantine Arabic Speech Corpus from LDC

Mon Feb 14 20:47:03 UTC 2005

------------------------------------------------------------------------ 
-
Arabic-L: Mon 14 Feb  2005
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
            unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:New Levantine Arabic Speech Corpus from LDC

-------------------------Messages-----------------------------------
1)
Date: 141 Feb  2005
From:reposted from Corpora
Subject:New Levantine Arabic Speech Corpus from LDC

LDC2005S08
BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts

LDC2005T01
Chinese Treebank 5.0

LDC2005S07
Levantine Arabic QT Training Data Set 3 Speech

LDC2005T03
Levantine Arabic QT Training Data Set 3 Transcripts

  The Linguistic Data Consortium (LDC) would like to announce the  
availability of four new corpora.

(1)  BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts  
consists of transcribed, spontaneous speech, recorded from subjects  
speaking in Levantine colloquial Arabic. Levantine Arabic is the  
dialect of Arabic spoken by ordinary people in Lebanon, Jordan, Syria,  
and Palestine. It is significantly different from Modern Standard  
Arabic (MSA), in that it is a spoken rather than a written language. It  
includes different word pronunciations, and even different words.

  The corpus would be useful for anyone attempting to do speech  
recognition in Levantine colloquial Arabic, including for speech  
translation and spoken dialog systems. BBN/AUB DARPA Babylon Levantine  
Arabic Speech and Transcripts is distributed on two DVD-ROM.

  (2)  Chinese Treebank 5.0 is a 500K word corpus of Chinese text with  
syntactic bracketing. The corpus contains 824K Hanzi, 18K sentences,  
and 890 data files. The data is drawn from three sources: Xinhua  
(1994-1998), Information Services Department of HKSAR (1997), and  
Sinorama magazine, Taiwan (1996-1998 & 2000-2001)

  All files are GB encoded. Chinese Treebank 5.0 provides four versions  
of files: bracketed, raw, segmented and POS tagged. The raw, segmented  
and POS tagged versions are generated from the bracketed version and so  
do not reflect the previous annotation stages. Chinese Treebank 5.0 is  
distributed on one CD-ROM.  

  (3)  Levantine Arabic QT Training Data Set 3 Speech contains 322  
telephone conversations and totals about 50 hours of Levantine Arabic  
speech. Participants were instructed to speak on set topics.  Unlike  
the previous training data corpora (Set 1 and 2) which are nearly 100%  
Jordanian speakers, this corpus is mostly Lebanese (72%) plus a  
combination of others Levantine speakers.  Levantine Arabic QT Training  
Data Set 3 Speech is distributed on one DVD-ROM.

  (4)  Levantine Arabic QT Training Data Set 3 Transcripts contains the  
transcription for the Levantine Arabic QT Training Data Set 3.  There  
are 322 files is UTF-8 format. The corpus also contains a word list and  
speaker information files.  Levantine Arabic QT Training Data Set 3  
Transcripts is distributed on one CD-ROM.

If you need further information, or would like to inquire about  
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215  
573 2175.

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
3600 Market Street                             Fax:   (215) 573-2175
Suite 810                                          ldc at ldc.upenn.edu
Philadelphia, PA 19104                      http://www.ldc.upenn.edu

------------------------------------------------------------------------ 
--
End of Arabic-L:  14 Feb  2005