[Corpora-List] News from LDC

Thu Jun 24 19:17:49 UTC 2010

*- Mark Liberman, LDC Director, wins the 2010 Antonio Zampolli Prize 
<#prize>** -*
*
*/New publications:/

LDC2010T07*
**- **Chinese Treebank 7.0 <#ctb> -*

LDC2010T11*
**- * *NIST 2003 Open Machine Translation (OpenMT) Evaluation* <#open>* -*

LDC2010V01*
**- TRECVID 2004 Keyframes & Transcripts <#trecvid>** -*

------------------------------------------------------------------------

*Mark Liberman, LDC Director, wins the 2010 Antonio Zampolli Prize*

LDC is proud to announce that our founder and Director, Mark Liberman, 
was awarded the 2010 Antonio Zampolli prize 
<http://www.elra.info/Antonio-Zampolli-Prize.html> at LREC2010 
<http://www.lrec-conf.org/lrec2010/>, hosted by ELRA 
<http://www.elra.info/>, the European Language Resource Association. 
This prestigious honor is given by ELRA's board members to recognize 
"outstanding contributions to the advancement of language resources and 
language technology evaluation within human language technologies". 

Mark's prize talk, delivered on May 21, 2010 and entitled The Future of 
Computational Linguistics: or, What Would Antonio Zampolli Do? 
<http://languagelog.ldc.upenn.edu/myl/AntonioZampolliPrizeLecture.pdf>, 
discussed Antonio Zampolli's far-reaching contributions to the language 
technology community and how his vision resonates in Mark's research. 
Please join us in congratulating Mark on receiving this award.

[ top <#top>]

*New Publications*

(1) C 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T07>hinese 
Treebank 7.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T07> 
consists of 840,000 words of annotated and parsed text from Chinese 
newswire, magazine news, and various broadcast news and broadcast 
conversation programs.  The Chinese Treebank project began at the 
University of Pennsylvania in 1998, continued at the University of 
Colorado, and is in the process of moving to Brandeis University 
<http://www.cs.brandeis.edu/%7Ellc/page2/page2.html>. The project 
provides a large, part-of-speech tagged and fully bracketed Chinese 
language corpus. The first deliveries provided syntactically annotated 
words from newswire texts.   The annotation of broadcast news and 
broadcast conversation data began and continues under the DARPA GALE 
(Global Autonomous Language Exploitation) program; Chinese Treebank 7.0 
represents the results of that effort.

Chinese Treebank 7.0 includes text from the following genres and sources.

*Genre*

*# words*

Newswire (Xinhua)

	250,000

News Magazine (Sinorama)

150,000

Broadcast News (CBS, CNR, CTS, CCTV, VOM)

270,000

Broadcast Conversation (CCTV, CNN, MSNBC, Phoenix)

170,000

Total

840,000

The annotation of syntactic structure trees for the Chinese newswire 
data was taken from Chinese Treebank 5.0 and updated with some 
corrections. Known problems, like multiple tree nodes at the top level, 
were fixed. Inconsistent annotations for object control verbs were also 
corrected. The residual Traditional Chinese characters in the Sinorama 
portion of the data, the result of incomplete automatic conversion, have 
been manually normalized to Simplified Chinese characters.

This release contains the frame files for each annotated verb or noun, 
which specify the argument structure (semantic roles) for each 
predicate. The frame files are effectively lexical guidelines for the 
propbank annotation. The semantic roles annotated in this data can only 
be interpreted with respect to these frame files.  The annotation of the 
verbs in the Xinhua news portion of the data is taken from Chinese 
Proposition Bank 1.0 (LDC2005T23) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T23>. 
The annotation of the predicate-argument structure of the included 
nouns, which are primarily nominalizations, has not been previously 
released. The Sinorama portion of the data, both for verbs and nouns, 
has not been previously released.

[ top <#top>]

*

(2) NIST 2003 Open Machine Translation (OpenMT) Evaluation 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T11> 
is a package containing source data, reference translations, and scoring 
software used in the NIST 2003 OpenMT evaluation. It is designed to help 
evaluate the effectiveness of machine translation systems. The package 
was compiled and scoring software was developed by researchers at NIST, 
making use of newswire source data and reference translations collected 
and developed by LDC.

The objective of the NIST OpenMT evaluation series is to support 
research in, and help advance the state of the art of, machine 
translation (MT) technologies -- technologies that translate text 
between human languages. Input may include all forms of text. The goal 
is for the output to be an adequate and fluent translation of the 
original. Additional information about these evaluations may be found at 
the NIST Open Machine Translation (OpenMT) Evaluation web site 
<http://www.itl.nist.gov/iad/mig/tests/mt/>.

This evaluation kit includes a single perl script that may be used to 
produce a translation quality score for one (or more) MT systems. The 
script works by comparing the system output translation with a set of 
(expert) reference translations of the same source text. Comparison is 
based on finding sequences of words in the reference translations that 
match word sequences in the system output translation.

The Chinese-language and Arabic-language source text included in this 
corpus is a reorganization of data that was initially released to the 
public respectively as Multiple-Translation Chinese (MTC) Part 4 
(LDC2006T04) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T04> 
and Multiple-Translation Arabic (MTA) Part 2 (LDC2005T05) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T05>. 
The reference translations are a reorganized subset of data from these 
same Multiple-Translation corpora. All source data for this corpus is 
newswire text collected in January and February of 2003 from Agence 
France-Presse, and Xinhua News Agency. For details on the methodology of 
the source data collection and production of reference translations, see 
the documentation for the above-mentioned corpora.

For each language, the test set consists of two files, a source and a 
reference file. Each reference file contains four independent 
translations of the data set. The evaluation year, source language, test 
set, version of the data, and source vs. reference file are reflected in 
the file name.

[ top <#top>]

*

(3) TRECVID 2004 Keyframes and Transcripts 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010V01> 
was developed as a collaborative effort between researchers at LDC, NIST 
<http://www.nist.gov/>, LIMSI-CNRS <http://www.limsi.fr/>, and Dublin 
City University <http://www.dcu.ie/>.  TREC Video Retrieval Evaluation 
(TRECVID) is sponsored by the National Institute of Standards and 
Technology (NIST) to promote progress in content-based retrieval from 
digital video via open, metrics-based evaluation. The keyframes in this 
release were extracted for use in the NIST TRECVID 2004 Evaluation.  
TRECVID is a laboratory-style evaluation that attempts to model real 
world situations or significant component tasks involved in such 
situations. In 2004 there were four main tasks with associated tests:

    * shot boundary determination
    * story segmentation
    * high-level feature extraction
    * search (interactive and manual)

For a detailed description of the TRECVID Evaluation Tasks, please refer 
to the NIST TRECVID 2004 Evaluation Description. 
<http://www-nlpir.nist.gov/projects/tv2004/>

The source data includes approximately 70 hours of English language 
broadcast programming collected by LDC in 1998 from ABC ("World News 
Tonight") and CNN ("CNN Headline News").

Shots are fundamental units of video, useful for higher-level 
processing. To create the master list of shots, the video was segmented. 
The results of this pass are called subshots. Because the master shot 
reference is designed for use in manual assessment, a second pass over 
the segmentation was made to create the master shots of at least 2 
seconds in length. These master shots are the ones used in submitting 
results for the feature and search tasks in the evaluation. In the 
second pass, starting at the beginning of each file, the subshots were 
aggregated, if necessary, until the current shot was at least 2 seconds 
in duration, at which point the aggregation began anew with the next 
subshot.

The keyframes were selected by going to the middle frame of the shot 
boundary, then parsing left and right of that frame to locate the 
nearest I-Frame. This then became the keyframe and was extracted. 
Keyframes have been provided at both the subshot (NRKF) and master shot 
(RKF) levels. 

[ top <#top>]

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                  http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100624/7a8f98a1/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora