<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

</head>

<body bgcolor="#ffffff" text="#000000">

<p class="MsoNormal" style="text-align: center;" align="center"><i>New

Publications:<br>

</i></p>

<p class="MsoNormal" style="text-align: center;" align="center">LDC2010S01<b><br>

</b><a href="#speech">-

<b>Fisher

Spanish Speech</b> -</a><br>

</p>

<p class="MsoNormal" style="text-align: center;" align="center">LDC2010T04<b><br>

<a href="#transcripts">-

Fisher

Spanish - Transcripts -</a></b></p>

<p class="MsoNormal" style="text-align: center;" align="center"><i>Other

news:</i><br>

</p>

<p class="MsoNormal" style="text-align: center;" align="center"><a

 href="#65"><b>- 65,000th LDC Corpus Distributed! -</b></a></p>

<p class="MsoNormal" style="text-align: center;" align="center"><a

 href="#pipeline"><b>-

2010

Publications Pipeline -</b></a></p>

<p class="MsoNormal" style="text-align: center;" align="center"><a

 href="imap://ldc@mail.ldc.upenn.edu:993/fetch%3EUID%3E/INBOX%3E753331#pipeline"><b></b></a></p>

<hr size="2" width="100%">

<p class="MsoNormal" style="text-align: center;" align="center"><br>

</p>

<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"

 align="center"><b>New

Publications</b></p>

<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"

 align="center"><br>

<o:p></o:p></p>

<p><a name="speech">(1)  </a><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010S01">Fisher

Spanish Speech</a> was developed by LDC and consists of audio files

covering

roughly 163 hours of telephone speech from 136 native Caribbean Spanish

and

non-Caribbean Spanish speakers. Full orthographic transcripts of these

audio

files are available in <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T04">Fisher

Spanish - Transcripts (LDC2010T04)</a>. <o:p></o:p></p>

<p>The Fisher telephone conversation collection protocol was created at

LDC to

address a critical need of developers trying to build robust automatic

speech

recognition (<st1:stockticker>ASR</st1:stockticker>) systems. Under the

Fisher

protocol, a very large number of participants each make a few calls of

short

duration speaking to other participants, whom they typically do not

know, about

assigned topics. This maximizes inter-speaker variation and vocabulary

breadth

although it also increases formality.  Previous protocols such as

CALLHOME, CALLFRIEND and Switchboard relied upon participant activity

to drive

the collection. Fisher is unique in being platform driven rather than

participant driven. Participants who wish to initiate a call may do so;

however

the collection platform initiates the majority of calls. Participants

need only

answer their phones at the times they specified when registering for

the study.

<o:p></o:p></p>

<p>To encourage a broad range of vocabulary, Fisher participants are

asked to

speak on an assigned topic which is selected at random from a list,

which

changes every 24 hours and which is assigned to all subjects paired on

that

day. Some topics are inherited or refined from previous Switchboard

studies

while others were developed specifically for the Fisher protocol. <o:p></o:p></p>

<p>In collecting data for this corpus, attempts were made to provide a

representative distribution of subjects across a variety of demographic

categories including: gender, age, dialect region, and education

level. 

Native speakers of Caribbean Spanish and non-Caribbean Spanish were

recruited

from within the continental <st1:country-region><st1:place>United

States</st1:place></st1:country-region>

and <st1:place>Puerto Rico</st1:place>. <o:p></o:p></p>

<p>The speech recordings consist of 819 telephone conversations of 10

to 12

minutes in duration. They are provided as digital audio files in NIST

SPHERE

format (1024-byte ASCII file headers). The conversations were recorded

as

2-channel mu-law sample data with 8000 samples per second (as captured

from the

public telephone network).<o:p></o:p></p>

<br>

<p class="MsoNormal" style="margin-bottom: 12pt;">[<a href="#top">

top </a>]<br>

<o:p></o:p></p>

<p class="MsoNormal" style="text-align: center;" align="center">*<o:p></o:p></p>

<p><a name="transcripts">(2)</a> <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T04">Fisher

Spanish - Transcripts</a> was developed by LDC and contains full

orthographic

transcripts of the telephone speech in <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010S01">Fisher

Spanish Speech (LDC2010S01)</a>. Transcripts cover roughly 163 hours of

telephone speech from 136 native Caribbean Spanish and non-Caribbean

Spanish

speakers. <o:p></o:p></p>

<p>The transcript files are in plain-text, tab-delimited format (tdf)

with

UTF-8 character encoding. They were created with the LDC-developed

transcription tool <a href="http://www.ldc.upenn.edu/tools/XTrans/">"XTrans"</a>,

which allowed for improved handling of multi-channel audio and

overlapping

speakers. XTrans is available from LDC. <o:p></o:p></p>

<p>Transcribers followed LDC's <a

 href="imap://ldc@mail.ldc.upenn.edu:993/fetch%3EUID%3E/INBOX%3E751173/;section=2.2?part=1.1.2&filename=trans_guide_nqrt_span.doc">Transcription

Guidelines (NQTR)</a>, which are included with the documentation for

this

release. <o:p></o:p></p>

<p><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010S01">Fisher

Spanish Speech (LDC2010S01)</a> provides the digital audio used as the

basis

for the transcriptions in this corpus, in the form of 2-channel mu-law

sample

data with 8000 samples per second (as captured from the public

telephone

network), for 819 telephone conversations of 10 to 12 minutes in

duration. The

audio files are in NIST <st1:stockticker>SPH</st1:stockticker>ERE

format

(1024-byte ASCII file headers). <o:p></o:p></p>

[<a href="#top">

top </a>]

<p class="MsoNormal" style="text-align: center;" align="center"><a

 name="65"></a><b>65,000th LDC

Corpus Distributed!</b>

<o:p></o:p></p>

<p class="MsoNormal" style="margin-bottom: 12pt;"><b><br>

</b>LDC has recently reached another milestone.  Two years after having

distributed our 50,000th corpus, we have just distributed our

65,000th! 

To help us celebrate, we took the names of all the organizations that

had

licensed data on the day we distributed our 65,000th corpus and tossed

them

into a Phillies baseball cap.  <o:p></o:p></p>

<p class="MsoNormal" style="margin-bottom: 12pt;">We then randomly drew

a name,

and the winner is ...<st1:place><st1:placename>Swarthmore</st1:placename>

<st1:placetype>College</st1:placetype></st1:place>

and Universidad Carlos III de Madrid!  That's not a typo, we have two

lucky winners!  We are celebrating our 65,000th distribution by

awarding a

benefit of US$2000 each to both <st1:place><st1:placename>Swarthmore</st1:placename>

<st1:placetype>College</st1:placetype></st1:place> and Universidad

Carlos III

de Madrid. The benefit can be used towards membership or data licensing

fees at

any time this year.<br>

<br>

<st1:place><st1:placename>Swarthmore</st1:placename> <st1:placetype>College</st1:placetype></st1:place>

and Universidad Carlos III de Madrid join our other recipients of

landmark corpora

distributions:<o:p></o:p></p>

<ul type="disc">

  <li class="MsoNormal" style="">    Helsinki University of Technology,

Adaptive Informatics Research Centre (AIRC) - licensed our 50,000th

distribution in January 2008.<o:p></o:p></li>

  <li class="MsoNormal" style="">    Instituto de Engenharia de

Sistemas e Computadores (INESC) - licensed our 40,000th distribution in

November 2006.<o:p></o:p></li>

  <li class="MsoNormal" style="">    <st1:place><st1:placetype>University</st1:placetype>

of <st1:placename>Hawai'i</st1:placename></st1:place>, Manoa, Language

Analysis and Experimentation Laboratories - licensed our 15,000th

distribution in April 2002.<o:p></o:p></li>

</ul>

<p class="MsoNormal" style="margin-bottom: 12pt;">We would like to

thank both

members and non-members for helping the LDC reach this landmark

distribution.

The unceasing demand for LDC data from over 2800 organizations supports

our

mission to develop and share resources for research in human language

technologies. <o:p></o:p></p>

<p class="MsoNormal" style="margin-bottom: 12pt;"><br>

About our winners:<o:p></o:p></p>

<blockquote>

  <p class="MsoNormal"><st1:place><st1:placename>Swarthmore</st1:placename>

  <st1:placetype>College</st1:placetype></st1:place>

~ The Department of Computer Science offers courses that emphasize the

fundamental concepts of computer science, treating today's languages

and

systems as current examples of the underlying concepts. By educating

students

to think conceptually, we are preparing them to adapt to developments

in this

dynamic field. <br>

  <br>

Universidad Carlos III de Madrid ~ The Multimedia Processing Group aims

to make

a significant research contribution to the field of multimedia

processing,

especially focusing on combining signal analysis tools with emerging

machine

learning methods. Projects include automatic multimedia indexing,

automatic

speech recognition, and last-generation video coding. <o:p></o:p></p>

</blockquote>

<p class="MsoNormal" style="margin-bottom: 12pt;"><br>

<span style=""></span>[<a href="#top">

top </a>]<o:p></o:p></p>

<p style="margin-bottom: 12pt;"><br style="">

<!--[endif]--><o:p></o:p></p>

<p style="text-align: center;" align="center"><b><a name="pipeline"></a></b><b>2010

Publications Pipeline</b><o:p></o:p></p>

<p>For Membership Year 2010 (MY2010), we anticipate releasing a varied

selection of publications. Many publications are still in development,

but here

is a glimpse of what is in the pipeline for MY2010.  Please note that

this

list is tentative and subject to modifications.  Our planned

publications

for the coming months include:<o:p></o:p></p>

<blockquote>

  <p class="MsoNormal" style=""><i>Arabic

Treebank: Part 3 v 3.2</i> ~ a revision of Arabic Treebank: Part 3

(full

corpus) v 2.0 (MPG + Syntactic Analysis (LDC2005T20). The full Arabic

Treebank:  Part 3 has been revised according to the new Arabic Treebank

annotation guidelines.  The Arabic Treebank project consists of two

distinct phases: (a) Part-of-Speech (POS) tagging which divides the

text into

lexical tokens, and gives relevant information about each token such as

lexical

category, inflectional features, and a gloss, and (b) Arabic

Treebanking which

characterizes the constituent structures of word sequences, provides

categories

for each non-terminal node, and identifies null elements, co-reference,

traces,

etc. on-terminal node. Arabic Treebank:  Part 3 v 3.2 consists of 599

newswire stories from An Nahar.       

                <br>

  <br>

  <i>Chinese Treebank 7.0</i> ~ this release encompasses 2400 text

files,

containing 45000 sentences, 1.1 million words and 1.65 million hanzi

(Chinese

characters). The data is provided in two encodings: GBK and UTF-8, and

the annotation

has Penn Treebank-style labeled brackets.        <o:p></o:p></p>

  <p class="MsoNormal" style=""><i>Chinese

Web 5-gram Version 1</i> ~ contains n-grams (unigrams to five-grams)

and their

observed counts in 880 billion tokens of Chinese web data collected in

March

2008. All text was converted to UTF-8. A simple segmenter using the

same

algorithm used to generate the data is included. The set contains 3.9

billion

n-grams total.<br>

  <br>

  <i>NPS Chat Corpus Version 1.0</i> ~ consists of 10,567 posts

gathered from

age-specific chat rooms. Each file is a recording transcript from one

of these

chat rooms for a short period on a particular day.   In order to comply

with the chat services' terms of service, the posts have been

privacy-masked.

  Each post is annotated with a chat dialog-act tag, and individual

tokens

within each post are annotated with part-of-speech tags. <o:p></o:p></p>

  <p class="MsoNormal" style=""><i>WTIMIT</i> 

~ is a mobile wideband (i.e., 50 Hz – 7kHz) telephone adjunct to TIMIT

(LDC93S1).   WTIMIT has been derived as follows: the original TIMIT

speech files at 16 kHz sampling rate were concatenated to 11 signal

chunks each

being preceded by a 4 second calibration tone. These speech chunks were

transmitted via two prepared Nokia 6220 mobile phones over T-Mobile’s

3G

wideband mobile network in <st1:city><st1:place>The Hague</st1:place></st1:city>,

The Netherlands, employing the Adaptive Multirate Wideband (AMR-WB)

speech

codec. After data acquisition and deconcatenation by maximizing the

normalized

cross-correlation with the original speech files, a database was

obtained that

is time aligned with the original TIMIT data with good precision.

Accordingly,

all TIMIT label files can still be used.  WTIMIT is suitable for

research

on speech quality and intelligibility, and investigations on possible

wideband

upgrades of network-sided IVR systems with retrained or bandwidth

extended

acoustic models for automatic speech recognition.  WTIMIT will be

presented at LREC2010.<o:p></o:p></p>

</blockquote>

<p class="MsoNormal" style="margin-bottom: 12pt;">2010

Subscription Members are automatically sent all MY2010 data as it is

released.  2010 Standard Members are entitled to request 16 corpora for

free from MY2010.   Non-members may license most data for

research-use only.</p>

<p class="MsoNormal" style="margin-bottom: 12pt;">[<a href="#top">

top </a>]

</p>

<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"

 align="left"><br>

</p>

<hr size="2" width="100%">

<div align="center"><font face="Courier New, Courier, monospace"><small><small><big><br>

Ilya

Ahtaridis</big></small></small></font><br>

<font face="Courier New, Courier, monospace"><small><small><big>Membership

Coordinator</big></small></small></font><br>

<br>

<font face="Courier New, Courier, monospace"><small>--------------------------------------------------------------------</small></font><br>

</div>

<div align="center">

<pre class="moz-signature" cols="72"><font

 face="Courier New, Courier, monospace">Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

 Philadelphia, PA 19104 USA                   <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>

</div>

<pre class="moz-signature" cols="72">

</pre>

</body>

</html>