<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

</head>

<body bgcolor="#ffffff" text="#000000">

<div align="center"><i>New publications: </i><br>

</div>

<br>

<div align="center">LDC2010T05<br>

<b>- </b><b><a href="#nps">NPS

Internet Chatroom Conversation,

Release 1.0</a> -</b><br>

</div>

<p align="center">LDC2010S02<b><br>

- </b><b><a href="#wtimit">WTIMIT</a>

-</b></p>

<hr size="2" width="100%">

<p class="MsoNormal" style="text-align: center;" align="center"><b>New

Publications</b></p>

<p class="MsoNormal" style="text-align: center;" align="center"><br>

<o:p></o:p></p>

<p><a name="nps"></a>(1)<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T05">

NPS

Internet Chatroom Conversations, Release 1.0</a> consists of 10,567

English

posts (45,068 tokens) gathered from age-specific chat rooms of various

online

chat services in October and November 2006. Each file is a text

recording from

one of these chat rooms for a short period on a particular day. Users

should be

aware that some of the conversations in this corpus feature subjects

and language

that some people may find offensive or objectionable, including

discussions of

a sexual nature. This corpus was developed by researchers at the

Department of

Computer Science, <a href="http://www.nps.edu/">Naval Postgraduate

School</a>, <st1:place><st1:city>San Jose</st1:city>, <st1:state>California</st1:state></st1:place>. 

NPS

Internet Chatroom Conversations is one of the first text-based chat

corpora

tagged with lexical and discourse information. This corpus might be

used to

develop stochastic NLP applications that perform tasks such as

conversation

thread topic detection, author profiling, entity identification, and

social

network analysis. <o:p></o:p></p>

<p>Each post is annotated with a chat dialog-act tag, and individual

tokens

within each post are annotated with part-of-speech tags. 3,507

tokenized posts

were automatically tagged using a part-of-speech tagger trained on the <a

 href="http://www.cis.upenn.edu/%7Etreebank/">Penn Treebank</a>

corpora,

combined with a regular expression that identified privacy-masked user

names

and emoticons. Similarly, simple regular expression matching was

employed to

assign an initial chat dialog-act to each of this subset of posts. This

initial

tagging was verified by hand.<span style="">  </span>The

remaining 7,060 posts were POS-tagged using a POS tagger that was

trained on

the newly hand-tagged chat data and the Penn Treebank corpora.

Dialog-act

tagging on the remaining posts was accomplished using a

back-propagation neural

network trained on 21 features of the initial dialog-act-labeled posts.

The

tagging of this second group of posts was also manually verified.<span

 style="">  </span>Ultimately, all of the 10,567 privacy-masked

posts, representing 45,068 tokens, were annotated with manually

verified

part-of-speech and dialog act information. <o:p></o:p></p>

<br>

<p>[<a href="#top">

top </a>]<br>

<o:p></o:p></p>

<p style="text-align: center;" align="center">*<o:p></o:p></p>

<p><a name="wtimit"></a>(2)<span style=""></span><span style="">  </span><span

 style=""></span><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010S02">WTIMIT

1.0</a> is a wideband mobile telephony derivative of <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1">TIMIT

Acoustic-Phonetic Continuous Speech Corpus (TIMIT, LDC93S1)</a>. TIMIT

contains

wideband speech recordings (i.e., sampled at 16 kHz) of 630 speakers in

American English from eight major dialectic regions, each reading ten

phonetically rich sentences. While some <span style=""> </span>derivative

TIMIT corpora consist of wideband

speech, others are telephony corpora representing narrowband speech,

i.e.,

sampled at 8 kHz and containing frequency components from about 300 Hz

to 3.4

kHz. Until now, no real-world wideband telephony speech corpus has been

publicly available. Due to upcoming wideband speech codecs, such as

G.722,

G.722.1, G.722.2, and G.711.1, wideband telephony speech transmission

is

already feasible in an increasing number of mobile networks. Hence, a

wideband

telephone bandwidth adjunct to TIMIT is desirable for a wide range of

scientific investigations, as well as development and evaluation of

systems,

e.g., Interactive Voice Response (IVR) systems. WTIMIT 1.0 (Wideband

Mobile T<st1:stockticker>IMIT</st1:stockticker>)

contains the recordings of the original TIMIT speech files after

transmission

over a real 3G AMR-WB mobile network. <o:p></o:p></p>

<p>WTIMIT 1.0 is organized according to the original TIMIT corpus. The

training

subset consists of 4620 speech files, while the test subset contains

1680

speech files. The speech format of the WTIMIT corpus is raw (i.e., no

header

information).  The recordings are in 1-channel linear PCM format at 16

kHz

sampling frequency and 16 bits resolution.<o:p></o:p></p>

<p>Data preparation was conducted by converting the original TIMIT

speech files

into raw data and concatenating them to 11 signal chunks of at most 30

minutes

duration. In order to allow precise de-concatenation after

transmission, and in

order to be able to examine codec influence and channel distortion,

each signal

chunk is preceded by a 4 s calibration tone. It comprises 2 s of a 1

kHz sine

wave followed by another 2 s of a linear sweep from 0 to 8 kHz. After

having

stored the prepared speech chunks on a laptop PC, they were transmitted

over

T-Mobile's AMR-WB-capable 3G mobile network in <st1:city><st1:place>The

Hague</st1:place></st1:city>,

The Netherlands. <o:p></o:p></p>

<p>The transmitted speech chunks were decimated from 48 kHz to 16 kHz

sampling

rate using a high-quality lowpass filter. Finally, they were

de-concatenated by

maximizing the cross-correlation between them and the original speech

files.<span style="">  </span>Utterances in WTIMIT 1.0 can be

considered to be time-aligned with an average precision of 0.0625 ms

(one

sample) with those of TIMIT. TIMIT's original label files (*.<st1:stockticker>TXT</st1:stockticker>,

*.WRD, *.PHN) are valid for WTIMIT as well. However, misalignments of

about 10

to 20 ms were found to be frequently produced by the channel mainly

during

speech pauses. Parts of the affected speech files are therefore

slightly

misaligned against the original label information. <o:p></o:p></p>

<br>

<p class="MsoNormal">[<a href="#top">

top </a>]</p>

<hr size="2" width="100%">

<div align="center"><font face="Courier New, Courier, monospace"><small><small><big><br>

Ilya

Ahtaridis</big></small></small></font><br>

<font face="Courier New, Courier, monospace"><small><small><big>Membership

Coordinator</big></small></small></font><br>

<br>

<font face="Courier New, Courier, monospace"><small>--------------------------------------------------------------------</small></font>

<br>

</div>

<div align="center">

<pre class="moz-signature" cols="72"><font

 face="Courier New, Courier, monospace">Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

 Philadelphia, PA 19104 USA                   <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>

</div>

<br>

</body>

</html>