[Corpora-List] New from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Wed Mar 24 15:55:23 UTC 2010
/New publications: /
LDC2010T05
*- **NPS Internet Chatroom Conversation, Release 1.0 <#nps> -*
LDC2010S02*
- **WTIMIT <#wtimit> -*
------------------------------------------------------------------------
*New Publications*
(1) NPS Internet Chatroom Conversations, Release 1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T05>
consists of 10,567 English posts (45,068 tokens) gathered from
age-specific chat rooms of various online chat services in October and
November 2006. Each file is a text recording from one of these chat
rooms for a short period on a particular day. Users should be aware that
some of the conversations in this corpus feature subjects and language
that some people may find offensive or objectionable, including
discussions of a sexual nature. This corpus was developed by researchers
at the Department of Computer Science, Naval Postgraduate School
<http://www.nps.edu/>, San Jose, California. NPS Internet Chatroom
Conversations is one of the first text-based chat corpora tagged with
lexical and discourse information. This corpus might be used to develop
stochastic NLP applications that perform tasks such as conversation
thread topic detection, author profiling, entity identification, and
social network analysis.
Each post is annotated with a chat dialog-act tag, and individual tokens
within each post are annotated with part-of-speech tags. 3,507 tokenized
posts were automatically tagged using a part-of-speech tagger trained on
the Penn Treebank <http://www.cis.upenn.edu/%7Etreebank/> corpora,
combined with a regular expression that identified privacy-masked user
names and emoticons. Similarly, simple regular expression matching was
employed to assign an initial chat dialog-act to each of this subset of
posts. This initial tagging was verified by hand. The remaining 7,060
posts were POS-tagged using a POS tagger that was trained on the newly
hand-tagged chat data and the Penn Treebank corpora. Dialog-act tagging
on the remaining posts was accomplished using a back-propagation neural
network trained on 21 features of the initial dialog-act-labeled posts.
The tagging of this second group of posts was also manually verified.
Ultimately, all of the 10,567 privacy-masked posts, representing 45,068
tokens, were annotated with manually verified part-of-speech and dialog
act information.
[ top <#top>]
*
(2) WTIMIT 1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010S02>
is a wideband mobile telephony derivative of TIMIT Acoustic-Phonetic
Continuous Speech Corpus (TIMIT, LDC93S1)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1>.
TIMIT contains wideband speech recordings (i.e., sampled at 16 kHz) of
630 speakers in American English from eight major dialectic regions,
each reading ten phonetically rich sentences. While some derivative
TIMIT corpora consist of wideband speech, others are telephony corpora
representing narrowband speech, i.e., sampled at 8 kHz and containing
frequency components from about 300 Hz to 3.4 kHz. Until now, no
real-world wideband telephony speech corpus has been publicly available.
Due to upcoming wideband speech codecs, such as G.722, G.722.1, G.722.2,
and G.711.1, wideband telephony speech transmission is already feasible
in an increasing number of mobile networks. Hence, a wideband telephone
bandwidth adjunct to TIMIT is desirable for a wide range of scientific
investigations, as well as development and evaluation of systems, e.g.,
Interactive Voice Response (IVR) systems. WTIMIT 1.0 (Wideband Mobile
TIMIT) contains the recordings of the original TIMIT speech files after
transmission over a real 3G AMR-WB mobile network.
WTIMIT 1.0 is organized according to the original TIMIT corpus. The
training subset consists of 4620 speech files, while the test subset
contains 1680 speech files. The speech format of the WTIMIT corpus is
raw (i.e., no header information). The recordings are in 1-channel
linear PCM format at 16 kHz sampling frequency and 16 bits resolution.
Data preparation was conducted by converting the original TIMIT speech
files into raw data and concatenating them to 11 signal chunks of at
most 30 minutes duration. In order to allow precise de-concatenation
after transmission, and in order to be able to examine codec influence
and channel distortion, each signal chunk is preceded by a 4 s
calibration tone. It comprises 2 s of a 1 kHz sine wave followed by
another 2 s of a linear sweep from 0 to 8 kHz. After having stored the
prepared speech chunks on a laptop PC, they were transmitted over
T-Mobile's AMR-WB-capable 3G mobile network in The Hague, The Netherlands.
The transmitted speech chunks were decimated from 48 kHz to 16 kHz
sampling rate using a high-quality lowpass filter. Finally, they were
de-concatenated by maximizing the cross-correlation between them and the
original speech files. Utterances in WTIMIT 1.0 can be considered to be
time-aligned with an average precision of 0.0625 ms (one sample) with
those of TIMIT. TIMIT's original label files (*.TXT, *.WRD, *.PHN) are
valid for WTIMIT as well. However, misalignments of about 10 to 20 ms
were found to be frequently produced by the channel mainly during speech
pauses. Parts of the affected speech files are therefore slightly
misaligned against the original label information.
[ top <#top>]
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100324/9fe9403f/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list