<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

</head>

<body bgcolor="#ffffff" text="#000000">

<p style="text-align: center;" align="center"><b>- <a href="#prize">Mark

Liberman, LDC

Director, wins the 2010 Antonio Zampolli Prize</a></b><b> -</b><o:p></o:p><br>

<b><br>

</b><i>New publications:</i><o:p></o:p></p>

<p style="text-align: center;" align="center">LDC2010T07<b><br>

</b><b>- </b><b><a href="#ctb">Chinese

Treebank 7.0</a> -</b><o:p></o:p></p>

<p style="text-align: center;" align="center">LDC2010T11<b><br>

</b><b>- </b> <a href="#open"><b>NIST

2003

Open Machine

Translation (OpenMT) Evaluation</b></a><b> -</b><o:p></o:p></p>

<p style="text-align: center;" align="center">LDC2010V01<b><br>

</b><b>- <a href="#trecvid">TRECVID

2004 Keyframes & Transcripts</a></b><b> -</b></p>

<p style="text-align: center;" align="center"><o:p></o:p></p>

<div class="MsoNormal" style="text-align: center;" align="center">

<hr align="center" size="2" width="100%"></div>

<p class="MsoNormal" style="text-align: center;" align="center"><a

 name="prize"></a></p>

<p class="MsoNormal" style="text-align: center;" align="center"><b>Mark

Liberman, LDC Director, wins the 2010 Antonio

Zampolli Prize</b><o:p></o:p></p>

<p class="MsoNormal">LDC is proud to announce that our founder and

Director, Mark

Liberman, was awarded the 2010 <a

 href="http://www.elra.info/Antonio-Zampolli-Prize.html">Antonio

Zampolli prize</a>

at <a href="http://www.lrec-conf.org/lrec2010/">LREC2010</a>, hosted

by <a href="http://www.elra.info/">ELRA</a>,

the

European Language Resource Association. This prestigious honor is given

by ELRA’s board members to

recognize “outstanding contributions to the advancement of language

resources

and language technology evaluation within human language

technologies”.  <o:p></o:p></p>

<p class="MsoNormal"><o:p></o:p>Mark’s prize talk, delivered on <st1:date

 month="5" day="21" year="2010">May 21, 2010</st1:date> and entitled <a

 href="http://languagelog.ldc.upenn.edu/myl/AntonioZampolliPrizeLecture.pdf">The

Future of Computational Linguistics: or, What Would Antonio Zampolli Do?</a>,

discussed

Antonio Zampolli’s far-reaching contributions to the language

technology

community and how his vision resonates in Mark’s research. Please join

us in

congratulating Mark on receiving this award.<br>

</p>

<p class="MsoNormal">[<a href="#top">

top </a>]<br>

<o:p></o:p></p>

<p class="MsoNormal" style="margin-bottom: 12pt;">

</p>

<p class="MsoNormal" style="margin-bottom: 12pt;"><br>

<o:p></o:p></p>

<p style="text-align: center;" align="center"><b>New Publications</b><o:p></o:p></p>

<p><a name="ctb"></a>(1) <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T07">C</a><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T07">hinese

Treebank 7.0</a> consists of 840,000 words of annotated and parsed text

from

Chinese newswire, magazine news, and various broadcast news and

broadcast

conversation programs.  The Chinese Treebank project began at the <st1:place><st1:placetype>University</st1:placetype>

of <st1:placename>Pennsylvania</st1:placename></st1:place> in 1998,

continued

at the <st1:place><st1:placetype>University</st1:placetype> of <st1:placename>Colorado</st1:placename></st1:place>,

and is in the process of moving to <a

 href="http://www.cs.brandeis.edu/%7Ellc/page2/page2.html">Brandeis

University</a>.

The project provides a large, part-of-speech tagged and fully bracketed

Chinese

language corpus. The first deliveries provided syntactically annotated

words

from newswire texts.   The annotation of broadcast news and broadcast

conversation data began and continues under the DARPA GALE (Global

Autonomous

Language Exploitation) program; Chinese Treebank 7.0 represents the

results of

that effort.<o:p></o:p></p>

<p>Chinese Treebank 7.0 includes text from the following genres and

sources.<o:p></o:p></p>

<table class="MsoNormalTable" style="width: 80%;" border="1"

 cellpadding="0" width="80%">

  <tbody>

    <tr style="">

      <td style="padding: 0.75pt; width: 39%;" width="39%">

      <p class="MsoNormal"><strong>Genre</strong><o:p></o:p></p>

      </td>

      <td style="padding: 0.75pt; width: 61%;" width="61%">

      <p class="MsoNormal"><strong># words</strong><o:p></o:p></p>

      </td>

    </tr>

    <tr style="">

      <td style="padding: 0.75pt;">

      <p class="MsoNormal">Newswire (Xinhua)<o:p></o:p></p>

      </td>

      <td style="padding: 0.75pt;">250,000<br>

      </td>

    </tr>

    <tr style="">

      <td style="padding: 0.75pt;">

      <p class="MsoNormal">News Magazine (Sinorama)<o:p></o:p></p>

      </td>

      <td style="padding: 0.75pt;">

      <p class="MsoNormal">150,000<o:p></o:p></p>

      </td>

    </tr>

    <tr style="">

      <td style="padding: 0.75pt;">

      <p class="MsoNormal">Broadcast News (CBS, CNR, CTS, CCTV, VOM)<o:p></o:p></p>

      </td>

      <td style="padding: 0.75pt;">

      <p class="MsoNormal">270,000<o:p></o:p></p>

      </td>

    </tr>

    <tr style="">

      <td style="padding: 0.75pt;">

      <p class="MsoNormal">Broadcast Conversation (CCTV, CNN, MSNBC, <st1:city><st1:place>Phoenix</st1:place></st1:city>)<o:p></o:p></p>

      </td>

      <td style="padding: 0.75pt;">

      <p class="MsoNormal">170,000<o:p></o:p></p>

      </td>

    </tr>

    <tr style="">

      <td style="padding: 0.75pt;">

      <p class="MsoNormal">Total<o:p></o:p></p>

      </td>

      <td style="padding: 0.75pt;">

      <p class="MsoNormal">840,000<o:p></o:p></p>

      </td>

    </tr>

  </tbody>

</table>

<p>The annotation of syntactic structure trees for the Chinese newswire

data

was taken from Chinese Treebank 5.0 and updated with some corrections.

Known

problems, like multiple tree nodes at the top level, were fixed.

Inconsistent

annotations for object control verbs were also corrected. The residual

Traditional Chinese characters in the Sinorama portion of the data, the

result

of incomplete automatic conversion, have been manually normalized to

Simplified

Chinese characters. <o:p></o:p></p>

<p>This release contains the frame files for each annotated verb or

noun, which

specify the argument structure (semantic roles) for each predicate. The

frame

files are effectively lexical guidelines for the propbank annotation.

The

semantic roles annotated in this data can only be interpreted with

respect to

these frame files.  The annotation of the verbs in the Xinhua news

portion

of the data is taken from <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T23">Chinese

Proposition Bank 1.0 (LDC2005T23)</a>. The annotation of the

predicate-argument

structure of the included nouns, which are primarily nominalizations,

has not

been previously released. The Sinorama portion of the data, both for

verbs and

nouns, has not been previously released.<o:p></o:p></p>

<br>

<p class="MsoNormal" style="">[<a href="#top">

top </a>]<br>

<o:p></o:p></p>

<p class="MsoNormal" style="text-align: center;" align="center">*<o:p></o:p></p>

<p><br>

<a name="open"></a>(2) <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T11">NIST

2003 Open Machine Translation (OpenMT) Evaluation</a> is a package

containing

source data, reference translations, and scoring software used in the

NIST 2003

OpenMT evaluation. It is designed to help evaluate the effectiveness of

machine

translation systems. The package was compiled and scoring software was

developed by researchers at NIST, making use of newswire source data

and

reference translations collected and developed by LDC. <o:p></o:p></p>

<p class="MsoNormal">The objective of the NIST OpenMT evaluation series

is to

support research in, and help advance the state of the art of, machine

translation (MT) technologies -- technologies that translate text

between human

languages. Input may include all forms of text. The goal is for the

output to

be an adequate and fluent translation of the original. Additional

information

about these evaluations may be found at the <a

 href="http://www.itl.nist.gov/iad/mig/tests/mt/">NIST Open Machine

Translation

(OpenMT) Evaluation web site</a>. <o:p></o:p></p>

<p>This evaluation kit includes a single perl script that may be used

to

produce a translation quality score for one (or more) MT systems. The

script

works by comparing the system output translation with a set of (expert)

reference translations of the same source text. Comparison is based on

finding

sequences of words in the reference translations that match word

sequences in

the system output translation.<o:p></o:p></p>

<p>The Chinese-language and Arabic-language source text included in

this corpus

is a reorganization of data that was initially released to the public

respectively as <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T04">Multiple-Translation

Chinese (MTC) Part 4 (LDC2006T04)</a> and <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T05">Multiple-Translation

Arabic (MTA) Part 2 (LDC2005T05)</a>. The reference translations are a

reorganized subset of data from these same Multiple-Translation

corpora. All

source data for this corpus is newswire text collected in January and

February

of 2003 from Agence France-Presse, and Xinhua News Agency. For details

on the

methodology of the source data collection and production of reference

translations, see the documentation for the above-mentioned corpora.<o:p></o:p></p>

<p>For each language, the test set consists of two files, a source and

a

reference file. Each reference file contains four independent

translations of

the data set. The evaluation year, source language, test set, version

of the

data, and source vs. reference file are reflected in the file name.<o:p></o:p></p>

<br>

<p class="MsoNormal" style="">[<a href="#top">

top </a>]<br>

<o:p></o:p></p>

<p class="MsoNormal" style="text-align: center;" align="center">*<o:p></o:p></p>

<p><a name="trecvid"></a>(3) <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010V01">TRECVID

2004 Keyframes and Transcripts</a> was developed as a collaborative

effort

between researchers at LDC, <a href="http://www.nist.gov/">NIST</a>, <a

 href="http://www.limsi.fr/">LIMSI-CNRS</a>, and <a

 href="http://www.dcu.ie/">Dublin

City University</a>.  TREC Video Retrieval Evaluation (TRECVID) is

sponsored by the National Institute of Standards and Technology (NIST)

to

promote progress in content-based retrieval from digital video via

open,

metrics-based evaluation. The keyframes in this release were extracted

for use

in the NIST TRECVID 2004 Evaluation.  TRECVID is a laboratory-style

evaluation that attempts to model real world situations or significant

component

tasks involved in such situations. In 2004 there were four main tasks

with

associated tests: <o:p></o:p></p>

<ul type="disc">

  <li class="MsoNormal" style="">shot boundary determination <o:p></o:p></li>

  <li class="MsoNormal" style="">story segmentation <o:p></o:p></li>

  <li class="MsoNormal" style="">high-level feature extraction <o:p></o:p></li>

  <li class="MsoNormal" style="">search (interactive and manual) <o:p></o:p></li>

</ul>

<p>For a detailed description of the TRECVID Evaluation Tasks, please

refer to

the <a href="http://www-nlpir.nist.gov/projects/tv2004/">NIST TRECVID

2004

Evaluation Description.</a> <o:p></o:p></p>

<p>The source data includes approximately 70 hours of English language

broadcast programming collected by LDC in 1998 from ABC ("World News

Tonight") and CNN ("CNN Headline News"). <o:p></o:p></p>

Shots are fundamental units of video, useful for

higher-level processing. To create the master list of shots, the video

was

segmented. The results of this pass are called subshots. Because the

master

shot reference is designed for use in manual assessment, a second pass

over the

segmentation was made to create the master shots of at least 2 seconds

in

length. These master shots are the ones used in submitting results for

the

feature and search tasks in the evaluation. In the second pass,

starting at the

beginning of each file, the subshots were aggregated, if necessary,

until the

current shot was at least 2 seconds in duration, at which point the

aggregation

began anew with the next subshot. <br>

<br>

The keyframes were selected by going to the middle frame of the shot

boundary,

then parsing left and right of that frame to locate the nearest

I-Frame. This

then became the keyframe and was extracted. Keyframes have been

provided at

both the subshot (NRKF) and master shot (RKF) levels.  <br>

<br>

<p class="MsoNormal" style="">[<a href="#top">

top </a>]<br>

<o:p></o:p></p>

<hr size="2" width="100%">

<p class="MsoNormal" style="text-align: center;" align="center"><br>

</p>

<br>

<div align="center">

<pre class="moz-signature" cols="72"><big><font

 face="Courier New, Courier, monospace"><small><small><big>Ilya Ahtaridis</big></small></small></font>

<font face="Courier New, Courier, monospace"><small><small><big>Membership Coordinator</big></small></small></font></big>

<font face="Courier New, Courier, monospace"><small>--------------------------------------------------------------------</small></font>

<font face="Courier New, Courier, monospace">Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                  <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>

</div>

<pre class="moz-signature" cols="72">

</pre>

</body>

</html>