<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

<div class="moz-text-html" lang="x-western">

<div align="center"><b>The Linguistic Data

Consortium (LDC) would like to report on recent developments and

announce the availability of two new publications.</b><br>

<b>

</b><br>

<b>

</b><a

 href="http://www.ldc.upenn.edu/Membership/Agreements/member_announcement.shtml#1"><b>LDC

Celebrates its Fifteenth Anniversary!</b></a><br>

<br>

<b><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13">Free

Google Data (Web 1T 5-gram) Available</a></b><br>

<br>

LDC2007T09<br>

<b><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T09">ISI

Chinese-English Automatically Extracted Parallel Text</a></b><br>

<br>

LDC2007V02<br>

<b><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007V02">TRECVID

2003 Keyframes & Transcripts</a></b><br>

</div>

<b><br>

<br>

</b>

<hr size="2" width="100%"><b><br>

</b>

<div align="center"><b>LDC Celebrates its Fifteenth Anniversary!<br>

<br>

</b>

<div align="left">April 15, 2007 marked the start of the LDC's 15th

Anniversary year!  We have many milestones to celebrate this year

including the growth of our staff to include over 40 full-time

employees and a online catalog that includes over 350 linguistic

databases.  Since 1992, no less than 2,300 organizations from over 80

different nations have licensed LDC data.  This data has been made

available through donations, funded projects at LDC or elsewhere,

community initiatives, and, increasingly,  LDC initiatives.  Over the

past fifteen years, the LDC has grown from an organization that shares

existing language technology resources to one that also is at the

forefront of the creating new data resources, software tools, and

standards. <br>

</div>

<br>

<div align="left">As we celebrate throughout the year, look for new

membership offerings

and announcements.  And be sure to join us as we count down to the much

anticipated distribution of our 50,000th publication.<br>

<br>

</div>

<b>Free Google Data Available</b><br>

</div>

<br>

<br>

The LDC is pleased to announce that Google Inc. is providing

financial support for the distribution of its Web 1T 5-gram

(LDC2006T13) corpus to universities. As<br>

a result, LDC will make the corpus available at no charge to 50

non-member universities requesting a copy.  Shipping and handling

fees are also being covered by Google.  Note that quantities are

limited and the Web 1T 5-gram data is a popular publication.  We

appreciate Google's

generosity and its interest in supporting language research.  To

obtain a free copy, universities will need to sign and submit a copy of

the <b style=""><a

 href="http://www.ldc.upenn.edu/Catalog/nonmem_agree/Web_1T_5gram_V1_User_Agreement.html">User

License Agreement for Web 1T 5-gram Version </a><a

 href="http://www.ldc.upenn.edu/Catalog/nonmem_agree/Web_1T_5gram_V1_User_Agreement.html">1</a>

</b>.  Please email <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a> with your contact information.<br>

<br>

<br>

<div align="center"><b>New Publications</b><br>

</div>

<b><br>

<br>

</b>(1) <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T09">ISI

Chinese-English Automatically Extracted Parallel Text</a>

consists of Chinese-English parallel sentences, which were extracted

automatically from two monolingual corpora: Chinese Gigaword Second

Edition (LDC2006T02) and English Gigaword Second Edition (LDC2005T12).

The data was extracted from news articles published by Xinhua News

Agency.

<p>The corpus contains 558,567 sentence pairs; the word count on the

English side is approximately 16M words. The sentences in the parallel

corpus preserve the form and encoding of the texts in the original

Gigaword corpora.</p>

<p>For each sentence pair in the corpus the authors provide the names

of the documents from which the two sentences were extracted, as well

as a confidence score (between 0.5 and 1.0), which is indicative of

their degree of parallelism. The parallel sentence identification

approach is designed to judge sentence pairs in isolation from their

contexts, and can therefore find parallel sentences within document

pairs which are not parallel. The fact that two documents share several

parallel sentences does not necessarily mean the documents are parallel</p>

<p>In order to make this resource useful for research in Machine

Translation (MT), the authors made efforts to detect potential overlaps

between this data and the standard test and development data sets used

by the MT community.  <br>

</p>

<br>

<br>

<div align="center"><b>*</b><br>

</div>

<br>

TREC Video Retrieval Evaluation (TRECVID) is sponsored by the

National Institute of Standards and Technology (NIST) to promote

progress in content-based retrieval from digital video via open,

metrics-based evaluation. The keyframes in <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007V02">TRECVID

2003 Keyframes

& Transcripts</a><b> </b>were extracted for use in the NIST

TRECVID 2003 Evaluation.   The source data used were English language

broadcast programming collected by LDC in 1998 from ABC ("World News

Tonight") and CNN ("CNN Headline News"). <br>

<br>

TRECVID is a laboratory-style evaluation that attempts to model real

world situations or significant component tasks involved in such

situations. In 2003 there were four main tasks with associated tests: <br>

<br>

<ul>

  <li>shot boundary determination </li>

</ul>

<ul>

  <li>story segmentation <br>

  </li>

</ul>

<ul>

  <li>high-level feature extraction </li>

</ul>

<ul>

  <li>search (interactive and manual) </li>

</ul>

<br>

Shots are fundamental units of video, useful for higher-level

processing. To create the master list of shots, the video was

segmented. The results of this pass are called subshots. Because the

master shot reference is designed for use in manual assessment, a

second pass over the segmentation was made to create the master shots

of at least 2 seconds in length. These master shots are the ones used

in submitting results for the feature and search tasks in the

evaluation. In the second pass, starting at the beginning of each file,

the subshots were aggregated, if necessary, until the current shot was

at least 2 seconds in duration, at which point the aggregation began

anew with the next subshot. <br>

<br>

The keyframes were selected by going to the middle frame of the shot

boundary, then parsing left and right of that frame to locate the

nearest I-Frame. This then became the keyframe and was extracted.

Keyframes have been provided at both the subshot (NRKF) and master shot

(RKF) levels.  <br>

<br>

<hr size="2" width="100%"><br>

<div align="center"><small><font face="Courier New, Courier, monospace">Ilya

Ahtaridis<br>

Membership Coordinator</font></small><br>

--------------------------------------------------------------------

<font face="Courier New, Courier, monospace"><br>

</font></div>

<div align="center">

<pre class="moz-signature" cols="72"><b><small><font

 face="Courier New, Courier, monospace">

</font></small>Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104                      <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></b></pre>

</div>

</div>

</body>

</html>