[Corpora-List] News from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Thu Nov 30 22:33:30 UTC 2006


------------------------------------------------------------------------
*
40,000th LDC Corpus Distributed!*

In 2003, the LDC celebrated its tenth anniversary and the distribution 
of our 15,000th corpus.  At that time, the LDC recognized the continued 
support of its constituent members by offering a free membership to the 
university which had licensed the 15,000th corpus. Three short years and 
many requests for data later, we are excited to have recently 
distributed our 40,000th corpus!   We would like to thank all 
organizations which have licensed data for helping the LDC reach this 
landmark distribution.  The growing demand for LDC data from over 2000 
organizations supports our mission to develop and share resources for 
research in linguistic technologies.  At the increased rate that we are 
distributing corpora, we anticipate the swift observance of our 50,000th 
distribution.  Stay tuned...



*New Publications

*

(1)  French Gigaword First Edition 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T17> 
is a comprehensive archive of newswire text data that has been acquired 
over several years by the Linguistic Data Consortium (LDC) at the 
University of Pennsylvania.

The two distinct international sources of French newswire in this 
edition, and the time spans of collection covered for each, are as follows:

    * Agence France-Presse (afp_fre) May 1994 - July 2006
    * Associated Press French Service (apw_fre) Nov 1994 - July 2006

The overall totals for each source are summarized below. Note that the 
"Totl-MB" numbers show the amount of data you get when the files are 
uncompressed (i.e. approximately 15 gigabytes, total); the "Gzip-MB" 
column shows totals for compressed file sizes as stored on the DVD-ROM; 
the "K-wrds" numbers are simply the number of whitespace-separated 
tokens (of all types) after all SGML tags are eliminated.

Source 	#Files 	Gzip-MB 	Totl-MB 	K-wrds 	#DOCs
AFP_FRE 	147 	1139 	3445 	482904 	1797139
APW_FRE 	141 	389 	1167 	167405 	622740
TOTAL 	288 	1528 	4612 	650309 	2419879


French Gigaword First Edition is distributed on one DVD-ROM. 



*

(2)  Iraqi Arabic Conversational Telephone Speech 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S45> 
contains 276 Iraqi Arabic speakers taking part in spontaneous telephone 
conversations in Colloquial Iraqi Arabic. A total of 976 conversation 
sides are provided (one speaker appears on two distinct calls). The 
average duration per side is about 6 minutes.

This corpus was collected and transcribed in 2003 and 2004 by Appen Pty 
Ltd, Sydney, Australia.  Iraqi Arabic Conversational Telephone Speech is 
distributed on one DVD-ROM.*
*

*


(3)  Iraqi Arabic Conversational Telephone Speech, Transcripts 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T16> 
contains 276 Iraqi Arabic speakers taking part in spontaneous telephone 
conversations in Colloquial Iraqi Arabic. A total of 976 conversation 
sides are provided (one speaker appears on two distinct calls). The 
average duration per side is about 6 minutes. This corpus was collected 
and transcribed in 2003 and 2004 by Appen Pty Ltd, Sydney, Australia.   
Iraqi Arabic Conversational Telephone Speech, Transcripts is distributed 
via web download.*
*



------------------------------------------------------------------------


If you need further information, or would like to inquire about 
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573 
1275.



--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                  http://www.ldc.upenn.edu



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20061130/d41cbf86/attachment.htm>


More information about the Corpora mailing list