[Corpora-List] New Releases from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Tue Feb 8 20:37:55 UTC 2005


LDC2005S08
*BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts *

LDC2005T01
*Chinese Treebank 5.0*

LDC2005S07
*Levantine Arabic QT Training Data Set 3 Speech*

LDC2005T03
*Levantine Arabic QT Training Data Set 3 Transcripts*


The Linguistic Data Consortium (LDC) would like to announce the
availability of four new corpora.

------------------------------------------------------------------------


(1)  BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S08>
consists of transcribed, spontaneous speech, recorded from subjects
speaking in Levantine colloquial Arabic. Levantine Arabic is the dialect
of Arabic spoken by ordinary people in Lebanon, Jordan, Syria, and
Palestine. It is significantly different from Modern Standard Arabic
(MSA), in that it is a spoken rather than a written language. It
includes different word pronunciations, and even different words.

The corpus would be useful for anyone attempting to do speech
recognition in Levantine colloquial Arabic, including for speech
translation and spoken dialog systems. BBN/AUB DARPA Babylon Levantine
Arabic Speech and Transcripts is distributed on two DVD-ROM.



(2)  Chinese Treebank 5.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T01>
is a 500K word corpus of Chinese text with syntactic bracketing. The
corpus contains 824K Hanzi, 18K sentences, and 890 data files. The data
is drawn from three sources: Xinhua (1994-1998), Information Services
Department of HKSAR (1997), and Sinorama magazine, Taiwan (1996-1998 &
2000-2001)

All files are GB encoded. Chinese Treebank 5.0 provides four versions of
files: bracketed, raw, segmented and POS tagged. The raw, segmented and
POS tagged versions are generated from the bracketed version and so do
not reflect the previous annotation stages. Chinese Treebank 5.0 is
distributed on one CD-ROM.



(3)  Levantine Arabic QT Training Data Set 3 Speech
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S07>
contains 322 telephone conversations and totals about 50 hours of
Levantine Arabic speech. Participants were instructed to speak on set
topics.  Unlike the previous training data corpora (Set 1 and 2) which
are nearly 100% Jordanian speakers, this corpus is mostly Lebanese (72%)
plus a combination of others Levantine speakers.  Levantine Arabic QT
Training Data Set 3 Speech is distributed on one DVD-ROM.



(4)  Levantine Arabic QT Training Data Set 3 Transcripts
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T03>
contains the transcription for the Levantine Arabic QT Training Data Set
3.  There are 322 files is UTF-8 format. The corpus also contains a word
list and speaker information files.  Levantine Arabic QT Training Data
Set 3 Transcripts is distributed on one CD-ROM.


------------------------------------------------------------------------

If you need further information, or would like to inquire about
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573
2175.


--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
3600 Market Street                             Fax:   (215) 573-2175
Suite 810                             	    	   ldc at ldc.upenn.edu
Philadelphia, PA 19104                 	    http://www.ldc.upenn.edu


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20050208/6d1523a4/attachment.htm>


More information about the Corpora mailing list