[Corpora-List] News from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Mon Apr 30 16:23:40 UTC 2007


*The Linguistic Data Consortium (LDC) would like to report on recent 
developments and announce the availability of two new publications.*
* *
* **LDC Celebrates its Fifteenth Anniversary!* 
<http://www.ldc.upenn.edu/Membership/Agreements/member_announcement.shtml#1>

*Free Google Data (Web 1T 5-gram) Available 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13>*

LDC2007T09
*ISI Chinese-English Automatically Extracted Parallel Text 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T09>*

LDC2007V02
*TRECVID 2003 Keyframes & Transcripts 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007V02>*
*

*
------------------------------------------------------------------------
*
*
*LDC Celebrates its Fifteenth Anniversary!

*
April 15, 2007 marked the start of the LDC's 15th Anniversary year!  We 
have many milestones to celebrate this year including the growth of our 
staff to include over 40 full-time employees and a online catalog that 
includes over 350 linguistic databases.  Since 1992, no less than 2,300 
organizations from over 80 different nations have licensed LDC data.  
This data has been made available through donations, funded projects at 
LDC or elsewhere, community initiatives, and, increasingly,  LDC 
initiatives.  Over the past fifteen years, the LDC has grown from an 
organization that shares existing language technology resources to one 
that also is at the forefront of the creating new data resources, 
software tools, and standards.

As we celebrate throughout the year, look for new membership offerings 
and announcements.  And be sure to join us as we count down to the much 
anticipated distribution of our 50,000th publication.

*Free Google Data Available*


The LDC is pleased to announce that Google Inc. is providing financial 
support for the distribution of its Web 1T 5-gram (LDC2006T13) corpus to 
universities. As
a result, LDC will make the corpus available at no charge to 50 
non-member universities requesting a copy.  Shipping and handling fees 
are also being covered by Google.  Note that quantities are limited and 
the Web 1T 5-gram data is a popular publication.  We appreciate Google's 
generosity and its interest in supporting language research.  To obtain 
a free copy, universities will need to sign and submit a copy of the 
*User License Agreement for Web 1T 5-gram Version 
<http://www.ldc.upenn.edu/Catalog/nonmem_agree/Web_1T_5gram_V1_User_Agreement.html>1 
<http://www.ldc.upenn.edu/Catalog/nonmem_agree/Web_1T_5gram_V1_User_Agreement.html> 
*.  Please email ldc at ldc.upenn.edu with your contact information.


*New Publications*
*

*(1) ISI Chinese-English Automatically Extracted Parallel Text 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T09> 
consists of Chinese-English parallel sentences, which were extracted 
automatically from two monolingual corpora: Chinese Gigaword Second 
Edition (LDC2006T02) and English Gigaword Second Edition (LDC2005T12). 
The data was extracted from news articles published by Xinhua News Agency.

The corpus contains 558,567 sentence pairs; the word count on the 
English side is approximately 16M words. The sentences in the parallel 
corpus preserve the form and encoding of the texts in the original 
Gigaword corpora.

For each sentence pair in the corpus the authors provide the names of 
the documents from which the two sentences were extracted, as well as a 
confidence score (between 0.5 and 1.0), which is indicative of their 
degree of parallelism. The parallel sentence identification approach is 
designed to judge sentence pairs in isolation from their contexts, and 
can therefore find parallel sentences within document pairs which are 
not parallel. The fact that two documents share several parallel 
sentences does not necessarily mean the documents are parallel

In order to make this resource useful for research in Machine 
Translation (MT), the authors made efforts to detect potential overlaps 
between this data and the standard test and development data sets used 
by the MT community. 



***

TREC Video Retrieval Evaluation (TRECVID) is sponsored by the National 
Institute of Standards and Technology (NIST) to promote progress in 
content-based retrieval from digital video via open, metrics-based 
evaluation. The keyframes in TRECVID 2003 Keyframes & Transcripts 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007V02>* 
*were extracted for use in the NIST TRECVID 2003 Evaluation.   The 
source data used were English language broadcast programming collected 
by LDC in 1998 from ABC ("World News Tonight") and CNN ("CNN Headline 
News").

TRECVID is a laboratory-style evaluation that attempts to model real 
world situations or significant component tasks involved in such 
situations. In 2003 there were four main tasks with associated tests:

    * shot boundary determination

    * story segmentation

    * high-level feature extraction

    * search (interactive and manual)


Shots are fundamental units of video, useful for higher-level 
processing. To create the master list of shots, the video was segmented. 
The results of this pass are called subshots. Because the master shot 
reference is designed for use in manual assessment, a second pass over 
the segmentation was made to create the master shots of at least 2 
seconds in length. These master shots are the ones used in submitting 
results for the feature and search tasks in the evaluation. In the 
second pass, starting at the beginning of each file, the subshots were 
aggregated, if necessary, until the current shot was at least 2 seconds 
in duration, at which point the aggregation began anew with the next 
subshot.

The keyframes were selected by going to the middle frame of the shot 
boundary, then parsing left and right of that frame to locate the 
nearest I-Frame. This then became the keyframe and was extracted. 
Keyframes have been provided at both the subshot (NRKF) and master shot 
(RKF) levels. 

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------

*
Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104                      http://www.ldc.upenn.edu*

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070430/a006809e/attachment.htm>


More information about the Corpora mailing list