<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff" text="#000000">
<div align="center"><i>New publications: </i><br>
</div>
<br>
<div align="center">LDC2010T05<br>
<b>- </b><b><a href="#nps">NPS
Internet Chatroom Conversation,
Release 1.0</a> -</b><br>
</div>
<p align="center">LDC2010S02<b><br>
- </b><b><a href="#wtimit">WTIMIT</a>
-</b></p>
<hr size="2" width="100%">
<p class="MsoNormal" style="text-align: center;" align="center"><b>New
Publications</b></p>
<p class="MsoNormal" style="text-align: center;" align="center"><br>
<o:p></o:p></p>
<p><a name="nps"></a>(1)<a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T05">
NPS
Internet Chatroom Conversations, Release 1.0</a> consists of 10,567
English
posts (45,068 tokens) gathered from age-specific chat rooms of various
online
chat services in October and November 2006. Each file is a text
recording from
one of these chat rooms for a short period on a particular day. Users
should be
aware that some of the conversations in this corpus feature subjects
and language
that some people may find offensive or objectionable, including
discussions of
a sexual nature. This corpus was developed by researchers at the
Department of
Computer Science, <a href="http://www.nps.edu/">Naval Postgraduate
School</a>, <st1:place><st1:city>San Jose</st1:city>, <st1:state>California</st1:state></st1:place>.
NPS
Internet Chatroom Conversations is one of the first text-based chat
corpora
tagged with lexical and discourse information. This corpus might be
used to
develop stochastic NLP applications that perform tasks such as
conversation
thread topic detection, author profiling, entity identification, and
social
network analysis. <o:p></o:p></p>
<p>Each post is annotated with a chat dialog-act tag, and individual
tokens
within each post are annotated with part-of-speech tags. 3,507
tokenized posts
were automatically tagged using a part-of-speech tagger trained on the <a
href="http://www.cis.upenn.edu/%7Etreebank/">Penn Treebank</a>
corpora,
combined with a regular expression that identified privacy-masked user
names
and emoticons. Similarly, simple regular expression matching was
employed to
assign an initial chat dialog-act to each of this subset of posts. This
initial
tagging was verified by hand.<span style=""> </span>The
remaining 7,060 posts were POS-tagged using a POS tagger that was
trained on
the newly hand-tagged chat data and the Penn Treebank corpora.
Dialog-act
tagging on the remaining posts was accomplished using a
back-propagation neural
network trained on 21 features of the initial dialog-act-labeled posts.
The
tagging of this second group of posts was also manually verified.<span
style=""> </span>Ultimately, all of the 10,567 privacy-masked
posts, representing 45,068 tokens, were annotated with manually
verified
part-of-speech and dialog act information. <o:p></o:p></p>
<br>
<p>[<a href="#top">
top </a>]<br>
<o:p></o:p></p>
<p style="text-align: center;" align="center">*<o:p></o:p></p>
<p><a name="wtimit"></a>(2)<span style=""></span><span style=""> </span><span
style=""></span><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010S02">WTIMIT
1.0</a> is a wideband mobile telephony derivative of <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1">TIMIT
Acoustic-Phonetic Continuous Speech Corpus (TIMIT, LDC93S1)</a>. TIMIT
contains
wideband speech recordings (i.e., sampled at 16 kHz) of 630 speakers in
American English from eight major dialectic regions, each reading ten
phonetically rich sentences. While some <span style=""> </span>derivative
TIMIT corpora consist of wideband
speech, others are telephony corpora representing narrowband speech,
i.e.,
sampled at 8 kHz and containing frequency components from about 300 Hz
to 3.4
kHz. Until now, no real-world wideband telephony speech corpus has been
publicly available. Due to upcoming wideband speech codecs, such as
G.722,
G.722.1, G.722.2, and G.711.1, wideband telephony speech transmission
is
already feasible in an increasing number of mobile networks. Hence, a
wideband
telephone bandwidth adjunct to TIMIT is desirable for a wide range of
scientific investigations, as well as development and evaluation of
systems,
e.g., Interactive Voice Response (IVR) systems. WTIMIT 1.0 (Wideband
Mobile T<st1:stockticker>IMIT</st1:stockticker>)
contains the recordings of the original TIMIT speech files after
transmission
over a real 3G AMR-WB mobile network. <o:p></o:p></p>
<p>WTIMIT 1.0 is organized according to the original TIMIT corpus. The
training
subset consists of 4620 speech files, while the test subset contains
1680
speech files. The speech format of the WTIMIT corpus is raw (i.e., no
header
information). The recordings are in 1-channel linear PCM format at 16
kHz
sampling frequency and 16 bits resolution.<o:p></o:p></p>
<p>Data preparation was conducted by converting the original TIMIT
speech files
into raw data and concatenating them to 11 signal chunks of at most 30
minutes
duration. In order to allow precise de-concatenation after
transmission, and in
order to be able to examine codec influence and channel distortion,
each signal
chunk is preceded by a 4 s calibration tone. It comprises 2 s of a 1
kHz sine
wave followed by another 2 s of a linear sweep from 0 to 8 kHz. After
having
stored the prepared speech chunks on a laptop PC, they were transmitted
over
T-Mobile's AMR-WB-capable 3G mobile network in <st1:city><st1:place>The
Hague</st1:place></st1:city>,
The Netherlands. <o:p></o:p></p>
<p>The transmitted speech chunks were decimated from 48 kHz to 16 kHz
sampling
rate using a high-quality lowpass filter. Finally, they were
de-concatenated by
maximizing the cross-correlation between them and the original speech
files.<span style=""> </span>Utterances in WTIMIT 1.0 can be
considered to be time-aligned with an average precision of 0.0625 ms
(one
sample) with those of TIMIT. TIMIT's original label files (*.<st1:stockticker>TXT</st1:stockticker>,
*.WRD, *.PHN) are valid for WTIMIT as well. However, misalignments of
about 10
to 20 ms were found to be frequently produced by the channel mainly
during
speech pauses. Parts of the affected speech files are therefore
slightly
misaligned against the original label information. <o:p></o:p></p>
<br>
<p class="MsoNormal">[<a href="#top">
top </a>]</p>
<hr size="2" width="100%">
<div align="center"><font face="Courier New, Courier, monospace"><small><small><big><br>
Ilya
Ahtaridis</big></small></small></font><br>
<font face="Courier New, Courier, monospace"><small><small><big>Membership
Coordinator</big></small></small></font><br>
<br>
<font face="Courier New, Courier, monospace"><small>--------------------------------------------------------------------</small></font>
<br>
</div>
<div align="center">
<pre class="moz-signature" cols="72"><font
face="Courier New, Courier, monospace">Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 <a
class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a
class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>
</div>
<br>
</body>
</html>