<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

<div class="moz-text-html" lang="x-western">

<hr size="2" width="100%">

<div align="center"><b><br>

40,000th LDC Corpus Distributed!</b><br>

</div>

<br>

In 2003, the LDC celebrated its tenth anniversary and the distribution

of our 15,000th corpus.  At that time, the LDC recognized the continued

support of its constituent members by offering a free membership to the

university which had licensed the 15,000th corpus. Three short years

and many requests for data later, we are excited to have recently

distributed our 40,000th corpus!   We would like to thank all

organizations which have licensed data for helping the LDC

reach this landmark distribution.  The growing demand for LDC data from

over 2000 organizations supports our mission to develop and share<font>

resources for research

in linguistic technologies.</font>  At the increased rate that we are

distributing corpora, we anticipate the swift observance of our

50,000th distribution.  Stay tuned...<br>

<br>

<br>

<br>

<div align="center"><b>New Publications<br>

<br>

</b></div>

<p>(1)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T17">French

Gigaword First Edition</a> is a comprehensive archive of newswire

text data that has been acquired over several years by the Linguistic

Data Consortium (LDC) at the University of Pennsylvania.</p>

<p>The two distinct international sources of French newswire in this

edition, and the time spans of collection covered for each, are as

follows: </p>

<ul>

  <li>Agence France-Presse (afp_fre) May 1994 - July 2006 </li>

  <li>Associated Press French Service (apw_fre) Nov 1994 - July 2006 </li>

</ul>

<p>The overall totals for each source are summarized below. Note that

the "Totl-MB" numbers show the amount of data you get when the files

are uncompressed (i.e. approximately 15 gigabytes, total); the

"Gzip-MB" column shows totals for compressed file sizes as stored on

the DVD-ROM; the "K-wrds" numbers are simply the number of

whitespace-separated tokens (of all types) after all SGML tags are

eliminated.</p>

<table>

  <tbody>

    <tr>

      <td>Source</td>

      <td>#Files</td>

      <td>Gzip-MB</td>

      <td>Totl-MB</td>

      <td>K-wrds</td>

      <td>#DOCs</td>

    </tr>

    <tr>

      <td>AFP_FRE</td>

      <td>147</td>

      <td>1139</td>

      <td>3445</td>

      <td>482904</td>

      <td>1797139</td>

    </tr>

    <tr>

      <td>APW_FRE</td>

      <td>141</td>

      <td>389</td>

      <td>1167</td>

      <td>167405</td>

      <td>622740</td>

    </tr>

    <tr>

      <td>TOTAL</td>

      <td>288</td>

      <td>1528</td>

      <td>4612</td>

      <td>650309</td>

      <td>2419879</td>

    </tr>

  </tbody>

</table>

<br>

French Gigaword First Edition is distributed on one DVD-ROM.  <br>

<br>

<br>

<br>

<div align="center">*<br>

</div>

<br>

<p>(2)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S45">Iraqi

Arabic Conversational Telephone Speech</a> contains 276 Iraqi

Arabic speakers taking part in spontaneous telephone conversations in

Colloquial Iraqi Arabic. A total of 976 conversation sides are provided

(one speaker appears on two distinct calls). The average duration per

side is about 6 minutes.</p>

<p>This corpus was collected and transcribed in 2003 and 2004 by Appen

Pty Ltd, Sydney, Australia. 

Iraqi Arabic Conversational Telephone Speech

is distributed on one DVD-ROM.<b><br>

</b></p>

<div align="center">*<br>

</div>

<p><br>

</p>

<p>(3)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T16">Iraqi

Arabic Conversational Telephone Speech, Transcripts</a> contains

276 Iraqi Arabic speakers taking part in spontaneous telephone

conversations in Colloquial Iraqi Arabic. A total of 976 conversation

sides are provided (one speaker appears on two distinct calls). The

average duration per side is about 6 minutes. This corpus was collected

and transcribed in 2003 and 2004 by Appen Pty Ltd, Sydney, Australia.  

Iraqi Arabic Conversational

Telephone Speech, Transcripts

is distributed via web download.<b><br>

</b></p>

<p><br>

<br>

</p>

<hr size="2" width="100%">

<p><br>

</p>

<p align="center"><font face="Courier New"><small><big><font

 face="Times New Roman">If

you need further

information, or would like to inquire about

membership to the LDC, please email <a class="moz-txt-link-abbreviated"

 href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a> or call +1 215

573 1275.</font></big></small></font><br>

</p>

<p><font face="Courier New"><small><br>

<br>

</small></font>

</p>

<div align="center">--------------------------------------------------------------------<br>

</div>

<div align="center">

<pre class="moz-signature" cols="72">Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                  <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>

</div>

</div>

<br>

<br>

</body>

</html>