[Corpora-List] New from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Wed Mar 24 15:55:23 UTC 2010


/New publications: /

LDC2010T05
*- **NPS Internet Chatroom Conversation, Release 1.0 <#nps> -*

LDC2010S02*
- **WTIMIT <#wtimit> -*

------------------------------------------------------------------------

*New Publications*


(1) NPS Internet Chatroom Conversations, Release 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T05> 
consists of 10,567 English posts (45,068 tokens) gathered from 
age-specific chat rooms of various online chat services in October and 
November 2006. Each file is a text recording from one of these chat 
rooms for a short period on a particular day. Users should be aware that 
some of the conversations in this corpus feature subjects and language 
that some people may find offensive or objectionable, including 
discussions of a sexual nature. This corpus was developed by researchers 
at the Department of Computer Science, Naval Postgraduate School 
<http://www.nps.edu/>, San Jose, California.  NPS Internet Chatroom 
Conversations is one of the first text-based chat corpora tagged with 
lexical and discourse information. This corpus might be used to develop 
stochastic NLP applications that perform tasks such as conversation 
thread topic detection, author profiling, entity identification, and 
social network analysis.

Each post is annotated with a chat dialog-act tag, and individual tokens 
within each post are annotated with part-of-speech tags. 3,507 tokenized 
posts were automatically tagged using a part-of-speech tagger trained on 
the Penn Treebank <http://www.cis.upenn.edu/%7Etreebank/> corpora, 
combined with a regular expression that identified privacy-masked user 
names and emoticons. Similarly, simple regular expression matching was 
employed to assign an initial chat dialog-act to each of this subset of 
posts. This initial tagging was verified by hand.  The remaining 7,060 
posts were POS-tagged using a POS tagger that was trained on the newly 
hand-tagged chat data and the Penn Treebank corpora. Dialog-act tagging 
on the remaining posts was accomplished using a back-propagation neural 
network trained on 21 features of the initial dialog-act-labeled posts. 
The tagging of this second group of posts was also manually verified.  
Ultimately, all of the 10,567 privacy-masked posts, representing 45,068 
tokens, were annotated with manually verified part-of-speech and dialog 
act information.


[ top <#top>]

*

(2)  WTIMIT 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010S02> 
is a wideband mobile telephony derivative of TIMIT Acoustic-Phonetic 
Continuous Speech Corpus (TIMIT, LDC93S1) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1>. 
TIMIT contains wideband speech recordings (i.e., sampled at 16 kHz) of 
630 speakers in American English from eight major dialectic regions, 
each reading ten phonetically rich sentences. While some  derivative 
TIMIT corpora consist of wideband speech, others are telephony corpora 
representing narrowband speech, i.e., sampled at 8 kHz and containing 
frequency components from about 300 Hz to 3.4 kHz. Until now, no 
real-world wideband telephony speech corpus has been publicly available. 
Due to upcoming wideband speech codecs, such as G.722, G.722.1, G.722.2, 
and G.711.1, wideband telephony speech transmission is already feasible 
in an increasing number of mobile networks. Hence, a wideband telephone 
bandwidth adjunct to TIMIT is desirable for a wide range of scientific 
investigations, as well as development and evaluation of systems, e.g., 
Interactive Voice Response (IVR) systems. WTIMIT 1.0 (Wideband Mobile 
TIMIT) contains the recordings of the original TIMIT speech files after 
transmission over a real 3G AMR-WB mobile network.

WTIMIT 1.0 is organized according to the original TIMIT corpus. The 
training subset consists of 4620 speech files, while the test subset 
contains 1680 speech files. The speech format of the WTIMIT corpus is 
raw (i.e., no header information).  The recordings are in 1-channel 
linear PCM format at 16 kHz sampling frequency and 16 bits resolution.

Data preparation was conducted by converting the original TIMIT speech 
files into raw data and concatenating them to 11 signal chunks of at 
most 30 minutes duration. In order to allow precise de-concatenation 
after transmission, and in order to be able to examine codec influence 
and channel distortion, each signal chunk is preceded by a 4 s 
calibration tone. It comprises 2 s of a 1 kHz sine wave followed by 
another 2 s of a linear sweep from 0 to 8 kHz. After having stored the 
prepared speech chunks on a laptop PC, they were transmitted over 
T-Mobile's AMR-WB-capable 3G mobile network in The Hague, The Netherlands.

The transmitted speech chunks were decimated from 48 kHz to 16 kHz 
sampling rate using a high-quality lowpass filter. Finally, they were 
de-concatenated by maximizing the cross-correlation between them and the 
original speech files.  Utterances in WTIMIT 1.0 can be considered to be 
time-aligned with an average precision of 0.0625 ms (one sample) with 
those of TIMIT. TIMIT's original label files (*.TXT, *.WRD, *.PHN) are 
valid for WTIMIT as well. However, misalignments of about 10 to 20 ms 
were found to be frequently produced by the channel mainly during speech 
pauses. Parts of the affected speech files are therefore slightly 
misaligned against the original label information.


[ top <#top>]

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100324/9fe9403f/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list