[Corpora-List] News from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Thu Nov 30 22:33:30 UTC 2006
------------------------------------------------------------------------
*
40,000th LDC Corpus Distributed!*
In 2003, the LDC celebrated its tenth anniversary and the distribution
of our 15,000th corpus. At that time, the LDC recognized the continued
support of its constituent members by offering a free membership to the
university which had licensed the 15,000th corpus. Three short years and
many requests for data later, we are excited to have recently
distributed our 40,000th corpus! We would like to thank all
organizations which have licensed data for helping the LDC reach this
landmark distribution. The growing demand for LDC data from over 2000
organizations supports our mission to develop and share resources for
research in linguistic technologies. At the increased rate that we are
distributing corpora, we anticipate the swift observance of our 50,000th
distribution. Stay tuned...
*New Publications
*
(1) French Gigaword First Edition
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T17>
is a comprehensive archive of newswire text data that has been acquired
over several years by the Linguistic Data Consortium (LDC) at the
University of Pennsylvania.
The two distinct international sources of French newswire in this
edition, and the time spans of collection covered for each, are as follows:
* Agence France-Presse (afp_fre) May 1994 - July 2006
* Associated Press French Service (apw_fre) Nov 1994 - July 2006
The overall totals for each source are summarized below. Note that the
"Totl-MB" numbers show the amount of data you get when the files are
uncompressed (i.e. approximately 15 gigabytes, total); the "Gzip-MB"
column shows totals for compressed file sizes as stored on the DVD-ROM;
the "K-wrds" numbers are simply the number of whitespace-separated
tokens (of all types) after all SGML tags are eliminated.
Source #Files Gzip-MB Totl-MB K-wrds #DOCs
AFP_FRE 147 1139 3445 482904 1797139
APW_FRE 141 389 1167 167405 622740
TOTAL 288 1528 4612 650309 2419879
French Gigaword First Edition is distributed on one DVD-ROM.
*
(2) Iraqi Arabic Conversational Telephone Speech
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S45>
contains 276 Iraqi Arabic speakers taking part in spontaneous telephone
conversations in Colloquial Iraqi Arabic. A total of 976 conversation
sides are provided (one speaker appears on two distinct calls). The
average duration per side is about 6 minutes.
This corpus was collected and transcribed in 2003 and 2004 by Appen Pty
Ltd, Sydney, Australia. Iraqi Arabic Conversational Telephone Speech is
distributed on one DVD-ROM.*
*
*
(3) Iraqi Arabic Conversational Telephone Speech, Transcripts
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T16>
contains 276 Iraqi Arabic speakers taking part in spontaneous telephone
conversations in Colloquial Iraqi Arabic. A total of 976 conversation
sides are provided (one speaker appears on two distinct calls). The
average duration per side is about 6 minutes. This corpus was collected
and transcribed in 2003 and 2004 by Appen Pty Ltd, Sydney, Australia.
Iraqi Arabic Conversational Telephone Speech, Transcripts is distributed
via web download.*
*
------------------------------------------------------------------------
If you need further information, or would like to inquire about
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573
1275.
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20061130/d41cbf86/attachment.htm>
More information about the Corpora
mailing list