<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

</head>

<body bgcolor="#ffffff" text="#000000">

<p align="center"><b>-  Release of Additional NomBank Files  -</b></p>

<p align="center"><b>-  Switchboard Dialog Act Corpus Now Available  -</b><br>

</p>

<div align="center">LDC2008T17<br>

<b>-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T17">CALLHOME

Mandarin Chinese Transcripts - XML version</a>  -</b><br>

</div>

<p align="center">

LDC2008S07<b><br>

-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S07">CSLU:

ISOLET Spoken Letter Database Version 1.3</a>  -<br>

</b></p>

<div align="center">LDC2008T18<br>

<b>

-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T18">GALE

Phase 1 Chinese Broadcast News Parallel Text - Part 3</a>  -<br>

<br>

<br>

</b>

<hr size="2" width="100%"></div>

<b>

<br>

</b>

<p style="margin-bottom: 12pt; text-align: center;" align="center"><b>Release

of

Additional NomBank Files</b><o:p></o:p></p>

<p>NomBank is an annotation project at New York University which

provides

argument structure for instances of common nouns in <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T7">Treebank-2

(LDC95T7) </a><span style=""> </span><span style=""> </span>and <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42">Treebank-3

(LDC99T42)</a>, also known as the 'Penn Treebanks'.  Last December, the

project released NomBank.1.0 which covers all the "markable" nouns in

the Wall Street Journal material in <span style=""></span>the

Penn Treebanks.   That release included a total of 114,576 propositions

derived from looking at a total of 202,965 noun instances and choosing

only those nouns whose arguments occur in the text.  NomBank and

related

resources are available from the <a

 href="http://nlp.cs.nyu.edu/meyers/NomBank.html">NomBank</a> project

website.<o:p></o:p></p>

<p>The LDC is now making available additional NomBank data which have

been restricted due to licensing arrangements with their owners. Those

files are as follows:<o:p></o:p></p>

<ul type="disc">

  <li class="MsoNormal" style=""><b><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T23">NomBank

v 1.0 (LDC2008T23)</a></b><o:p></o:p></li>

  <ul type="circle">

    <li class="MsoNormal" style="">a complete printout of NomBank in

human-readable form.  <o:p></o:p></li>

  </ul>

</ul>

<p class="MsoNormal" style="margin-bottom: 12pt;">   

            A license to either <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T7">Treebank-2

(LDC95T7) </a>or <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42">Treebank-3

(LDC99T42) </a><span style=""> </span>is required to obtain

NomBank v1.0.<o:p></o:p></p>

<ul type="disc">

  <li class="MsoNormal" style=""><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T24"><b>COMNOM

v 1.0 (LDC2008T24)</b></a><o:p></o:p></li>

  <ul type="circle">

    <li class="MsoNormal" style="">COMNOM is created by automatically

adding classes to COMLEX Syntax on the basis of NOMLEX-PLUS.  For

details, please see the document entitled "<a

 href="http://nlp.cs.nyu.edu/meyers/nombank/those-other-nombank-dictionaries.pdf">Those

Other NomBank Dictionaries</a>". <o:p></o:p></li>

  </ul>

</ul>

<blockquote>

  <p class="MsoNormal">A license to <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98L21">COMLEX

English Syntax Lexicon (LDC98L21)</a> or <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96T11">COMLEX

Syntax Text Corpus Version 2.0 (LDC96T11)</a> is required to obtain

COMNOM v 1.0.<o:p></o:p></p>

</blockquote>

<p class="MsoNormal"><o:p> <br>

</o:p></p>

<p class="MsoNormal" style="margin-bottom: 12pt;">All requests for

these files can be directed to <a href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a><br>

<br style="">

<!--[if !supportLineBreakNewLine]--><br style="">

<!--[endif]--><o:p></o:p></p>

<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"

 align="center"><b>Switchboard

Dialog Act Corpus Now Available<br style="">

<!--[if !supportLineBreakNewLine]--><br style="">

<!--[endif]--></b><o:p></o:p></p>

<p class="MsoNormal" style="margin-bottom: 12pt;">

The Switchboard Dialog Act Corpus is a version of the <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S62">Switchboard-1

Release 2</a>

corpus of telephone conversations tagged with a shallow discourse

tagset  of approximately 60 basic dialog act tags and combinations.<span

 style="">  </span>The

discourse tag-set used is an augmentation of the Discourse Annotation

and Markup System of Labeling (DAMSL) tag-set, and is referred to as

the 'SWBD-DAMSL' labels. These annotations were created in 1997 at the

University of Colorado at Boulder, with the goal of building better

language models for automatic speech recognition of the Switchboard

domain. To that end the label-set incorporates both traditional

sociolinguistic and discourse-theoretic rhetorical

relations/adjacency-pairs as well as some more-form-based labels. The

Switchboard Dialog Act Corpus contains labels for 1155 5-minute

conversations, comprising 205,000 utterances and 1.4 million words.  <br>

</p>

<p class="MsoNormal" style="margin-bottom: 12pt;">To

download this corpus from our ftp server, please visit the LDC catalog

page for <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S62">Switchboard-1

Release 2</a> and scroll down to the section entitled 'Updates'.</p>

<p class="MsoNormal" style="margin-bottom: 12pt;"><br>

<o:p></o:p></p>

<p style="margin-bottom: 12pt;"><br>

<o:p></o:p></p>

<p style="margin-bottom: 12pt; text-align: center;" align="center"><b>New

Publications</b></p>

<p>(1) LDC's CALLHOME Mandarin Chinese collection includes telephone

speech,

associated transcripts and a lexicon. <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96S34">CALLHOME

Mandarin Chinese Speech</a> consists of 120 unscripted telephone

conversations

between native speakers of Mandarin Chinese. All calls, which lasted up

to

thirty minutes, originated in <st1:place>North America</st1:place> and

were

placed to locations overseas; most participants called family members

or close

friends. <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96T16">CALLHOME

Mandarin Chinese Transcripts</a> covers a contiguous five or ten-minute

segment

from each of the telephone speech files. The transcripts are in

tab-delimited

format with GB2312 encoding, are timestamped by speaker turn for

alignment with

the speech signal and are provided in standard orthography. <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96T16">CALLHOME

Mandarin Chinese Lexicon</a> is comprised of over 40,000 words from

twenty

CALLHOME Mandarin transcripts. <o:p></o:p></p>

<p><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T17">C</a><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T17">ALLHOME

Mandarin Chinese Transcripts - XML Version</a>, the latest addition to

this

collection, was created by Lancaster University and presents the entire

original corpus of 120 transcripts in XML format with UTF-8 encoding,

retokenization and part-of-speech (POS) tagging. The retokenization and

POS

information were supplied using the Chinese Lexical Analysis System

(ICTCLAS)

developed by the <a href="http://www.ict.ac.cn/english/">Institute of

Computing

Technology, Chinese Academy of Sciences</a>, <st1:city><st1:place>Beijing</st1:place></st1:city>.

ICTCLAS aims to incorporate Chinese word segmentation, POS tagging,

disambiguation and unknown words recognition into a single theoretical

framework using multi-layered hierarchical hidden Markov models. <o:p></o:p></p>

<p>In addition to the original applications for Mandarin Chinese

CALLHOME data

(e.g., speech recognition), CALLHOME Mandarin Chinese Transcripts - XML

Version

will be useful in the grammatical study of spoken Mandarin.  This XML

corpus retains all of the linguistic analyses (e.g., timestamps, spoken

features and proper nouns) from the original transcripts release, but

the

mnemonics used in the original release were migrated into XML markup. <o:p></o:p></p>

<p class="MsoNormal">All analyses in the original release were retained

at the sacrifice

of

tokenization and part-of-speech tagging accuracy (e.g., some mnemonics

encoding

spoken features may split a word, which can affect the tagging

accuracy).

However, the results of the automated processing were substantially

post-edited.  In addition, a large number of obvious typographical

errors

in the original release were corrected in the process of post-editing. 

<o:p></o:p></p>

<p><o:p> </o:p></p>

<p class="MsoNormal" style="text-align: center;" align="center"><b>*<br>

<br>

</b></p>

<p>(2) <span style="color: rgb(51, 102, 255);"><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S07">CSLU:

ISOLET Spoken Letter Database Version 1.3</a></span> was created by the

Center

for Spoken Language Understanding (CSLU) at OGI School of Science and

Engineering, Oregon Health and <st1:place><st1:placename>Science</st1:placename>

<st1:placetype>University</st1:placetype></st1:place>, <st1:place><st1:city>Beaverton</st1:city>,

<st1:state>Oregon</st1:state></st1:place>.  CSLU: <span

 style="color: black;">ISOLET Spoken Letter Database Version 1.3</span>

is a database of

letters of the English alphabet spoken in isolation under quiet

laboratory

conditions and associated transcripts. The data was collected in 1990

and

consists of two productions of each letter by 150 speakers (7800 spoken

letters) for approximately 1.25 hours of speech. The subjects consisted

of 75

male speakers and 75 female speakers; all speakers reported English as

their

native language.  <o:p></o:p></p>

<p>Speech was recorded in the OGI speech recognition laboratory and the

recording equipment was selected to mimic the equipment used to collect

the <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1">TIMIT</a>

database as closely as possible. The speech was recorded with a

Sennheiser HMD

224 noise-canceling microphone, low pass filtered at 7.6 kHz. Data

capture was

performed using the AT&T DSP32 board installed in a Sun 4/110. The

data

were sampled at 16 kHz and converted to RIFF(.WAV) format.<o:p></o:p></p>

<p>The transcriptions of the recorded speech are time-aligned phonetic

transcriptions conforming to the CSLU Labeling standards. Time-aligned

word

transcriptions are represented in a standard orthography or

romanization.

Speech and non-speech phenomena are distinguished. The transcriptions

are

aligned to a waveform by placing boundaries to mark the beginning and

ending of

words. In addition to the specification of boundaries, this level of

transcription includes additional commentary on salient speech and

non-speech

characteristics, such as glottalization, inhalation, and exhalation.  <o:p></o:p></p>

<p class="MsoNormal"><o:p> </o:p></p>

<p class="MsoNormal" style="text-align: center;" align="center"><b>*<br>

</b></p>

<p class="MsoNormal" style="text-align: center;" align="center"><br>

<o:p></o:p></p>

<p>(3) <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T18">GALE

Phase 1 Chinese Broadcast News Parallel Text - Part 3</a> contains

transcripts

and English translations of 19.1 hours of Chinese broadcast news

programming

from Voice of America (VOA), China Central TV (CCTV) and Phoenix TV. It

does

not contain the audio files from which the transcripts and translations

were

generated. GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3

is the

last the three-part GALE Phase 1 Chinese Broadcast News Parallel Text,

which,

along with other corpora, was used as training data in year 1 (Phase 1)

of the

DARPA-funded GALE program. LDC has previously released <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T23">GALE

Phase 1 Chinese Broadcast News Parallel Text - Part 1</a> and <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T08">GALE

Phase 1 Chinese Broadcast News Parallel Text - Part 2</a>.<o:p></o:p></p>

<p>A total of 19.1 hours of Chinese broadcast news recordings were

selected

from three sources: VOA, <span style=""> </span>CCTV (a

broadcaster from Mainland China) and Phoenix TV (a Hong Kong-based

satellite TV

station).  A manual selection procedure was used to choose data

appropriate for the GALE program, namely, news programs focusing on

current

events. Stories on topics such as sports, entertainment and business

were

excluded from the data set. Manual sentence units/segments (SU)

annotation was

also performed on a subset of files following LDC's Quick Rich

Transcription

specification. Three types of end of sentence SU were identified:

statement SU,

question SU, and incomplete SU. <o:p></o:p></p>

<p>After transcription and SU annotation, they were reformatted into a

human-readable translation format, and the files were then assigned to

professional translators for careful translation. Translators followed

LDC's

GALE Translation guidelines, which describe the makeup of the

translation team,

the source, data format, the translation data format, best practices

for

translating certain linguistic features (such as names and speech

disfluencies), and quality control procedures applied to completed

translations. <br>

</p>

<br>

<br>

<hr size="2" width="100%">

<div align="center"><font face="Courier New, Courier, monospace"><small><small><big><br>

Ilya

Ahtaridis</big></small></small></font><br>

<font face="Courier New, Courier, monospace"><small><small><big>

Membership Coordinator</big></small></small></font><br>

<font face="Courier New, Courier, monospace"><small><small>

</small></small></font><br>

<font face="Courier New, Courier, monospace"><small><small>

</small>--------------------------------------------------------------------</small></font><br>

<font face="Courier New, Courier, monospace"><small>

</small></font></div>

<div align="center">

<pre class="moz-signature" cols="72"><font

 face="Courier New, Courier, monospace">Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

 Philadelphia, PA 19104 USA                   <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>

</div>

<pre class="moz-signature" cols="72">

</pre>

</body>

</html>