[Corpora-List] New LDC Corpora

Thu Jul 7 20:18:23 UTC 2005

LDC2005T20
Arabic Treebank: Part 3 (full corpus) v2.0 (MPG + Syntactic Analysis) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T20>

LDC2005T10
Chinese English News Magazine Parallel Text 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T10>

LDC2005S14
Levantine Arabic QT Training Data Set 4 (Speech + Transcripts) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S14>

The Linguistic Data Consortium (LDC) is pleased to announce the 
availability of three new corpora.

------------------------------------------------------------------------

Arabic Treebank: Part 3 (full corpus) v2.0 (MPG + Syntactic Analysis) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T20> 
supports the development of data-driven approaches to natural language 
processing (NLP), human language technologies, automatic content 
extraction (topic extraction and/or grammar extraction), cross-lingual 
information retrieval, information detection, and other forms of 
linguistic research on Modern Standard Arabic in general. The LDC was 
sponsored to develop an Arabic POS and Treebank of 1,000,000 words, and 
this corpus is part three of that project. In this release, both 
syntactic (treebank) annotation and annotation on part of speech (POS), 
gloss, and word segmentation are provided.

The current Arabic Treebank: Part 3 corpus consists of 600 stories from 
the An Nahar News Agency. The new features include complete vocalization 
of all Imperfect Verb mood endings: Indicative, Subjunctive, and Jussive.

*

Chinese English News Magazine Parallel Text 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T10> 
contains Chinese news stories and their English translations drawn from 
Sinorama Magazine, Taiwan, from 1976 to 2004. The corpus totals 6,366 
story pairs, 365,568 sentence pairs, 20M Chinese characters and 9M 
English words. It is aligned at sentence level; the data obtained from 
Sinorama Magazine was aligned at the story level. The sentence alignment 
was done at the LDC using champollion v1.1. The Sinorama Chinese text is 
encoded in Big5.

*

Levantine Arabic QT Training Data Set 4 (Speech + Transcripts) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S14> 
contains 901 calls, totaling 133.6 hours of telephone conversation 
speech in Levantine Arabic. The majority of speakers in this corpus are 
Lebanese. The corpus also includes 901 transcript files is UTF-8 format. 
Speaker information files are provided.

------------------------------------------------------------------------

If you need further information, or would like to inquire about 
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573 
1275.

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
3600 Market Street                             Fax:   (215) 573-2175
Suite 810                             	    	   ldc at ldc.upenn.edu
Philadelphia, PA 19104                 	    http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20050707/ef55f902/attachment.htm>