<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

</head>

<body bgcolor="#ffffff" text="#000000">

<p align="center"><i>In this newsletter:<br>

</i></p>

<p align="center"> - <b><a href="#scholar">Fall

2010 LDC Data Scholarship Program</a></b>

-<br>

</p>

<p align="center"><b>- </b><a href="8#provide"><b>New

Providing

Guideline</b>s</a><b> -</b><br>

</p>

<p align="center"><i>New publications:</i><br>

</p>

<p align="center">LDC2010S05<br>

<b>- <a href="#elephant">Asian

Elephant Vocalizations</a></b><b> -<br>

</b></p>

<p align="center"><span

 style="font-size: 12pt; font-family: "Liberation Serif";">LDC2010T14</span><br>

<span style="font-size: 12pt; font-family: "Liberation Serif";"><b>- </b></span><b><a

 href="#openmt">NIST

2005 Open Machine

Translation (OpenMT) Evaluation</a></b><span

 style="font-size: 12pt; font-family: "Liberation Serif";"><b> -</b><br>

</span></p>

<div align="center">LDC2010V02<br>

<b>- <a href="#trecvid">TRECVID

2006 Keyframes</a></b><b> -<br>

</b></div>

<div align="center"><b><br>

</b>

<hr size="2" width="100%"><b><br>

</b></div>

<p style="text-align: center;" align="center"><a name="scholar"></a><b>Fall

2010 LDC Data

Scholarship

Program</b><o:p></o:p></p>

<p class="MsoNormal">Applications are now being accepted through <st1:date

 year="2010" day="15" month="9">September 15, 2010</st1:date> for the

Fall 2010

LDC Data Scholarship program!   The LDC Data Scholarship program

provides university students with access to LDC data at no-cost.  Data

scholarships

are offered twice a year to correspond to the Fall and Spring

semesters,

beginning with the Fall 2010 semester (September - December 2010).

Several

students can be awarded scholarships during each program cycle.  This

program is open to students pursuing both undergraduate and graduate

studies in

an accredited college or university. LDC Data Scholarships are not

restricted

to any particular field of study; however, students must demonstrate a

well-developed research agenda and a bona fide inability to pay.  <br>

<br>

The application consists of two parts:<o:p></o:p></p>

<p class="MsoNormal">(1) <em><b>Data Use Proposal</b></em>. Applicants

must

submit a proposal describing their intended use of the data. The

proposal must

contain the applicant's name, university, and field of study. The

proposal

should state which data the student plans to use and contain a

description of

their research project.  Students are advised to consult the <a

 href="http://www.ldc.upenn.edu/Catalog/index.jsp">LDC Corpus Catalog</a>

for a

complete list of data distributed by LDC. Due to certain restrictions,

a

handful of LDC corpora are restricted to members of the Consortium. <o:p></o:p></p>

<p>(2) <em><b>Letter of Support</b></em>. Applicants must submit one

letter of

support from their thesis advisor or department chair. The letter must

verify

the student's need for data and confirm that the department or

university lacks

the funding to pay the full Non-member Fee for the data.<o:p></o:p></p>

<p>For further information on application materials and program rules,

please

visit the <a href="http://www.ldc.upenn.edu/About/scholarships.html">LDC

Data

Scholarship</a> page.  <o:p></o:p></p>

<p>Students can email their applications to the <a

 href="mailto:datascholarships@ldc.upenn.edu">LDC Data Scholarship

program</a>.

Decisions will be sent by email from the same address.<o:p></o:p></p>

<p>The deadline for the Fall 2010 program cycle is <st1:date

 year="2010" day="15" month="9">September 15, 2010</st1:date>.<o:p></o:p></p>

<p>Track the LDC Data Scholarship program at <a

 href="http://www.wikicfp.com/">WikiCFP</a>!<br>

</p>

<p>

[<a href="#top">

top </a>]<br>

<o:p></o:p></p>

<p style="text-align: center;" align="center"><a name="provide"></a><b>New

Providing

Guidelines</b><o:p></o:p></p>

<p>LDC is pleased to announce that our <a

 href="http://www.ldc.upenn.edu/Providing/">Providing</a> page has been

recently

updated and enhanced to reflect detailed guidelines for submitting

corpora and

other resources for publication by LDC. The new Providing page

describes the

entire process of sharing data through LDC from the initial publication

inquiry

to delivery of the data for publication. LDC's preferred submission

formats for

video, audio, and text data and directory structure, and best practices

for

file naming conventions are covered in depth.  The page also includes

information on providing adequate metadata and documentation of your

data set.<br>

<br>

Researchers interested in publishing data through LDC are invited to

use the <a href="http://www.ldc.upenn.edu/Providing/subform.html">Publication

Inquiry Form</a>. 

The inquiry form will prompt you for basic information about your data

including title, author, language, details on corpus size and format,

as well

as a description.  Once your inquiry has been received, our External

Relations staff can assist you through each step of the publication

process.<o:p></o:p></p>

<p style="margin-bottom: 12pt;">Why share your data through LDC? 

Resources distributed by LDC reach a global audience. All published

resources

appear in LDC’s online <a href="http://www.ldc.upenn.edu/Catalog">Catalog</a>,

which is accessed daily by users worldwide. LDC’s monthly newsletter

keeps the

community abreast of all new publications, and its reach ensures the

attention

of interested researchers. LDC members receive copies of the corpora as

part of

their membership benefits. LDC’s Membership structure therefore

guarantees your

data greater exposure to major organizations working in human language

technologies <span style=""> </span>and related fields.<br>

<br>

The LDC Corpus Catalog contains a variety of resources in many

languages and

formats ranging from written to spoken and video. Speech and video data

may

derive from broadcast collections, interviews, and recordings of

telephone

conversations. Text data comes from a variety of sources including

newswire,

document archives and anthologies as well as the World Wide Web. LDC

also

publishes dictionaries and lexicons in a variety of languages.<br>

</p>

<p style="margin-bottom: 12pt;">

[<a href="#top">

top </a>] <o:p></o:p></p>

<p style="text-align: center;" align="center"><b><span

 style="font-family: "Liberation Serif";">New

Publications</span></b><o:p></o:p></p>

<p><a name="elephant"></a>(1)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010S05">Asian

Elephant Vocalizations</a> consists of 57.5 hours of audio recordings

of

vocalizations by Asian Elephants (<i>Elephas maximus</i>) in the Uda

Walawe

National Park, Sri Lanka, of which 31.25 hours have been annotated. The

collection and annotation of the recordings was conducted and overseen

by

Shermin de Silva, of the University of Pennsylvania Department of

Biology;

voice recording field notes are of Shermin de Silva and Ashoka

Ranjeewa. The

recordings primarily feature adult female and juvenile elephants.

Existing

knowledge of acoustic communication in elephants is based primarily on

African

species (<i>Loxodonta africana</i> and <i>Loxodonta cyclotis</i>).

There has

been comparatively less study of communication in Asian elephants.<o:p></o:p></p>

<p>This corpus is intended to enable researchers in acoustic

communication to

evaluate acoustic features and repertoire diversity of the recorded

population.

Of particular interest is whether there may be regional dialects that

differ

among Asian elephant populations in the wild and in captivity. A second

interest is in whether structural commonalities exist between this and

other

species that shed light on underlying social and ecological factors

shaping

communication systems.<o:p></o:p></p>

<p>Data were collected from May, 2006 to December, 2007. Observations

were

performed by vehicle during park hours from 0600 to 1830 h. Most

recordings of

vocalizations were made using an Earthworks QTC50 microphone

shock-mounted inside

a Rycote Zeppelin windshield, via a Fostex FR-2 field recorder (24-bit

sample

size, sampling rate 48 kHz). Recordings were initiated at the start of

a call

with a 10-s pre-record buffer so that the entire call was captured and

loss of

rare vocalizations minimized. This was made possible with the

'pre-record'

feature of the Fostex, which records continuously, but only saves the

file with

a 10-second lead once the 'record' button is depressed.<o:p></o:p></p>

<p>Certain audio files were manually annotated, to the extent possible,

with

call type, caller id, and miscellaneous notes. For call type

annotation, there

are three main categories of vocalizations: those that show clear

fundamental

frequencies (periodic), those that do not (a-periodic), and those that

show

periodic and a-periodic regions as at least two distinct segments.

Calls were

identified as belonging to one of 14 categories.  Annotations were made

using the <a

 href="http://www.fon.hum.uva.nl/praat/manual/TextGridEditor.html">Praat

TextGrid Editor</a>, which allows spectral analysis and annotation of

audio

files with overlapping events. Annotations were based on written and

audio-recorded field notes, and in some cases video recordings.

Miscellaneous

notes are free-form, and include such information as distance from

source,

caller identity certainty, and accompanying behavior.<o:p></o:p></p>

<p class="MsoNormal" style="margin-bottom: 12pt;"><br>

</p>

<p class="MsoNormal" style="margin-bottom: 12pt;">

[<a href="#top">

top </a>]<br>

<o:p></o:p></p>

<p class="MsoNormal" style="text-align: center;" align="center">*<o:p></o:p></p>

<p><a name="openmt"></a><span style="font-family: "Liberation Serif";">(2) 

<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T14">NIST

2005 Open Machine Translation (OpenMT) Evaluation</a> is a package

containing

source data, reference translations, and scoring software used in the

NIST 2005

OpenMT evaluation. It is designed to help evaluate the effectiveness of

machine

translation systems. The package was compiled and scoring software was

developed by researchers at NIST, making use of newswire source data

and

reference translations collected and developed by LDC. </span><o:p></o:p></p>

<p>The objective of the NIST OpenMT evaluation series is to support

research

in, and help advance the state of the art of, machine translation (MT)

technologies

-- technologies that translate text between human languages. Input may

include

all forms of text. The goal is for the output to be an adequate and

fluent

translation of the original.  The 2004 task was to evaluate translation

from Chinese to English and from Arabic to English. Additional

information

about these evaluations may be found at the <a

 href="http://www.itl.nist.gov/iad/mig/tests/mt/">NIST Open Machine

Translation

(OpenMT) Evaluation web site</a>.<br>

<br>

This evaluation kit includes a single perl script (mteval-v11a.pl) that

may be

used to produce a translation quality score for one (or more) MT

systems. The

script works by comparing the system output translation with a set of

(expert)

reference translations of the same source text. Comparison is based on

finding

sequences of words in the reference translations that match word

sequences in

the system output translation.<br>

<br>

<span style="font-family: "Liberation Serif";">This corpus consists of

100 Arabic

newswire documents, 100 Chinese newswire documents, and a corresponding

set of

four separate human expert reference translations. Source text for both

languages was collected from Agence France-Presse and Xinhua News

Agency in

December 2004 and January 2005.</span><o:p></o:p></p>

<p>For each language, the test set consists of two files: a source and

a

reference file. Each reference file contains four independent

translations of

the data set. The evaluation year, source language, test set, version

of the

data, and source vs. reference file are reflected in the file name.

</p>

[<a href="#top"> top </a>]

<p class="MsoNormal" style="text-align: center;" align="center"><br>

<o:p><br>

</o:p>*<o:p></o:p></p>

<p class="MsoBodyText"><a name="trecvid"></a>(3)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010V02">TRECVID

2006 Keyframes</a> was developed as a collaborative effort between

researchers

at LDC, <a href="http://www.nist.gov/">NIST</a>, <a

 href="http://www.limsi.fr/">LIMSI-CNRS</a>,

and <a href="http://www.dcu.ie/">Dublin City University</a> <span

 style=""> </span>TREC Video Retrieval Evaluation (TRECVID) is

sponsored by the National Institute of Standards and Technology (NIST)

to

promote progress in content-based retrieval from digital video via

open,

metrics-based evaluation. The keyframes in this release were extracted

for use

in the NIST TRECVID 2006 Evaluation. <o:p></o:p></p>

<p class="MsoBodyText">TRECVID is a laboratory-style evaluation that

attempts to

model real world situations or significant component tasks involved in

such

situations. In 2006 TRECVID <span style=""> </span>completed a

2-year cycle on English, Arabic, and Chinese news video. There weree

three

system tasks and associated tests: <o:p></o:p></p>

<ul type="disc">

  <li class="MsoNormal" style="">shot boundary determination<o:p></o:p></li>

  <li class="MsoNormal" style="">high-level feature extraction<o:p></o:p></li>

  <li class="MsoNormal" style="">search (interactive,

manually-assisted, and/or fully automatic)<o:p></o:p></li>

</ul>

<p class="MsoBodyText">For a detailed description of the TRECVID

Evaluation

Tasks, please refer to the <a

 href="http://www-nlpir.nist.gov/projects/tv2006/">NIST

TRECVID 2006 Evaluation Description.</a><o:p></o:p></p>

<p class="MsoBodyText">The video stills that compose this corpus are

drawn from

approximately 158.6 hours of English, Arabic, and Chinese language

video data

collected by LDC from NBC, CNN, MSNBC, New Tang Dynasty TV, Phoenix TV,

Lebanese Broadcasting Corp., and China Central TV. <o:p></o:p></p>

<p class="MsoBodyText">Shots are fundamental units of video, useful for

higher-level processing. To create the master list of shots, the video

was

segmented. The results of this pass are called subshots. Because the

master

shot reference is designed for use in manual assessment, a second pass

over the

segmentation was made to create the master shots of at least 2 seconds

in

length. These master shots are the ones used in submitting results for

the

feature and search tasks in the evaluation. In the second pass,

starting at the

beginning of each file, the subshots were aggregated, if necessary,

until the

current shot was at least 2 seconds in duration, at which point the

aggregation

began anew with the next subshot. <br>

<br>

The keyframes were selected by going to the middle frame of the shot

boundary,

then parsing left and right of that frame to locate the nearest

I-Frame. This

then became the keyframe and was extracted. Keyframes have been

provided at

both the subshot (NRKF) and master shot (RKF) levels.<o:p></o:p></p>

<br>

<br>

[<a href="#top">

top </a>]<br>

<br>

<hr size="2" width="100%">

<div align="center">

<pre class="moz-signature" cols="72"><big><font

 face="Courier New, Courier, monospace"><small><small><big>

Ilya Ahtaridis</big></small></small></font>

<font face="Courier New, Courier, monospace"><small><small><big>Membership Coordinator</big></small></small></font></big>

<font face="Courier New, Courier, monospace"><small>--------------------------------------------------------------------</small></font>

<font face="Courier New, Courier, monospace">Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                  <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>

</div>

<pre class="moz-signature" cols="72">

</pre>

</body>

</html>