[Corpora-List] News from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Mon Mar 1 16:21:02 UTC 2010
/New Publications:
/
LDC2010S01*
*- *Fisher Spanish Speech* - <#speech>
LDC2010T04*
- Fisher Spanish - Transcripts - <#transcripts>*
/Other news:/
*- 65,000th LDC Corpus Distributed! -* <#65>
*- 2010 Publications Pipeline -* <#pipeline>
**
<imap://ldc@mail.ldc.upenn.edu:993/fetch%3EUID%3E/INBOX%3E753331#pipeline>
------------------------------------------------------------------------
*New Publications*
(1) Fisher Spanish Speech
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010S01>
was developed by LDC and consists of audio files covering roughly 163
hours of telephone speech from 136 native Caribbean Spanish and
non-Caribbean Spanish speakers. Full orthographic transcripts of these
audio files are available in Fisher Spanish - Transcripts (LDC2010T04)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T04>.
The Fisher telephone conversation collection protocol was created at LDC
to address a critical need of developers trying to build robust
automatic speech recognition (ASR) systems. Under the Fisher protocol, a
very large number of participants each make a few calls of short
duration speaking to other participants, whom they typically do not
know, about assigned topics. This maximizes inter-speaker variation and
vocabulary breadth although it also increases formality. Previous
protocols such as CALLHOME, CALLFRIEND and Switchboard relied upon
participant activity to drive the collection. Fisher is unique in being
platform driven rather than participant driven. Participants who wish to
initiate a call may do so; however the collection platform initiates the
majority of calls. Participants need only answer their phones at the
times they specified when registering for the study.
To encourage a broad range of vocabulary, Fisher participants are asked
to speak on an assigned topic which is selected at random from a list,
which changes every 24 hours and which is assigned to all subjects
paired on that day. Some topics are inherited or refined from previous
Switchboard studies while others were developed specifically for the
Fisher protocol.
In collecting data for this corpus, attempts were made to provide a
representative distribution of subjects across a variety of demographic
categories including: gender, age, dialect region, and education level.
Native speakers of Caribbean Spanish and non-Caribbean Spanish were
recruited from within the continental United States and Puerto Rico.
The speech recordings consist of 819 telephone conversations of 10 to 12
minutes in duration. They are provided as digital audio files in NIST
SPHERE format (1024-byte ASCII file headers). The conversations were
recorded as 2-channel mu-law sample data with 8000 samples per second
(as captured from the public telephone network).
[ top <#top>]
*
(2) Fisher Spanish - Transcripts
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T04>
was developed by LDC and contains full orthographic transcripts of the
telephone speech in Fisher Spanish Speech (LDC2010S01)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010S01>.
Transcripts cover roughly 163 hours of telephone speech from 136 native
Caribbean Spanish and non-Caribbean Spanish speakers.
The transcript files are in plain-text, tab-delimited format (tdf) with
UTF-8 character encoding. They were created with the LDC-developed
transcription tool "XTrans" <http://www.ldc.upenn.edu/tools/XTrans/>,
which allowed for improved handling of multi-channel audio and
overlapping speakers. XTrans is available from LDC.
Transcribers followed LDC's Transcription Guidelines (NQTR)
<imap://ldc@mail.ldc.upenn.edu:993/fetch%3EUID%3E/INBOX%3E751173/;section=2.2?part=1.1.2&filename=trans_guide_nqrt_span.doc>,
which are included with the documentation for this release.
Fisher Spanish Speech (LDC2010S01)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010S01>
provides the digital audio used as the basis for the transcriptions in
this corpus, in the form of 2-channel mu-law sample data with 8000
samples per second (as captured from the public telephone network), for
819 telephone conversations of 10 to 12 minutes in duration. The audio
files are in NIST SPHERE format (1024-byte ASCII file headers).
[ top <#top>]
*65,000th LDC Corpus Distributed!*
*
*LDC has recently reached another milestone. Two years after having
distributed our 50,000th corpus, we have just distributed our 65,000th!
To help us celebrate, we took the names of all the organizations that
had licensed data on the day we distributed our 65,000th corpus and
tossed them into a Phillies baseball cap.
We then randomly drew a name, and the winner is ...Swarthmore College
and Universidad Carlos III de Madrid! That's not a typo, we have two
lucky winners! We are celebrating our 65,000th distribution by awarding
a benefit of US$2000 each to both Swarthmore College and Universidad
Carlos III de Madrid. The benefit can be used towards membership or data
licensing fees at any time this year.
Swarthmore College and Universidad Carlos III de Madrid join our other
recipients of landmark corpora distributions:
* Helsinki University of Technology, Adaptive Informatics
Research Centre (AIRC) - licensed our 50,000th distribution in
January 2008.
* Instituto de Engenharia de Sistemas e Computadores (INESC) -
licensed our 40,000th distribution in November 2006.
* University of Hawai'i, Manoa, Language Analysis and
Experimentation Laboratories - licensed our 15,000th distribution
in April 2002.
We would like to thank both members and non-members for helping the LDC
reach this landmark distribution. The unceasing demand for LDC data from
over 2800 organizations supports our mission to develop and share
resources for research in human language technologies.
About our winners:
Swarthmore College ~ The Department of Computer Science offers
courses that emphasize the fundamental concepts of computer science,
treating today's languages and systems as current examples of the
underlying concepts. By educating students to think conceptually, we
are preparing them to adapt to developments in this dynamic field.
Universidad Carlos III de Madrid ~ The Multimedia Processing Group
aims to make a significant research contribution to the field of
multimedia processing, especially focusing on combining signal
analysis tools with emerging machine learning methods. Projects
include automatic multimedia indexing, automatic speech recognition,
and last-generation video coding.
[ top <#top>]
***2010 Publications Pipeline*
For Membership Year 2010 (MY2010), we anticipate releasing a varied
selection of publications. Many publications are still in development,
but here is a glimpse of what is in the pipeline for MY2010. Please
note that this list is tentative and subject to modifications. Our
planned publications for the coming months include:
/Arabic Treebank: Part 3 v 3.2/ ~ a revision of Arabic Treebank:
Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis (LDC2005T20).
The full Arabic Treebank: Part 3 has been revised according to the
new Arabic Treebank annotation guidelines. The Arabic Treebank
project consists of two distinct phases: (a) Part-of-Speech (POS)
tagging which divides the text into lexical tokens, and gives
relevant information about each token such as lexical category,
inflectional features, and a gloss, and (b) Arabic Treebanking which
characterizes the constituent structures of word sequences, provides
categories for each non-terminal node, and identifies null elements,
co-reference, traces, etc. on-terminal node. Arabic Treebank: Part
3 v 3.2 consists of 599 newswire stories from An Nahar.
/Chinese Treebank 7.0/ ~ this release encompasses 2400 text files,
containing 45000 sentences, 1.1 million words and 1.65 million hanzi
(Chinese characters). The data is provided in two encodings: GBK and
UTF-8, and the annotation has Penn Treebank-style labeled
brackets.
/Chinese Web 5-gram Version 1/ ~ contains n-grams (unigrams to
five-grams) and their observed counts in 880 billion tokens of
Chinese web data collected in March 2008. All text was converted to
UTF-8. A simple segmenter using the same algorithm used to generate
the data is included. The set contains 3.9 billion n-grams total.
/NPS Chat Corpus Version 1.0/ ~ consists of 10,567 posts gathered
from age-specific chat rooms. Each file is a recording transcript
from one of these chat rooms for a short period on a particular day.
In order to comply with the chat services' terms of service, the
posts have been privacy-masked. Each post is annotated with a chat
dialog-act tag, and individual tokens within each post are annotated
with part-of-speech tags.
/WTIMIT/ ~ is a mobile wideband (i.e., 50 Hz -- 7kHz) telephone
adjunct to TIMIT (LDC93S1). WTIMIT has been derived as follows:
the original TIMIT speech files at 16 kHz sampling rate were
concatenated to 11 signal chunks each being preceded by a 4 second
calibration tone. These speech chunks were transmitted via two
prepared Nokia 6220 mobile phones over T-Mobile's 3G wideband mobile
network in The Hague, The Netherlands, employing the Adaptive
Multirate Wideband (AMR-WB) speech codec. After data acquisition and
deconcatenation by maximizing the normalized cross-correlation with
the original speech files, a database was obtained that is time
aligned with the original TIMIT data with good precision.
Accordingly, all TIMIT label files can still be used. WTIMIT is
suitable for research on speech quality and intelligibility, and
investigations on possible wideband upgrades of network-sided IVR
systems with retrained or bandwidth extended acoustic models for
automatic speech recognition. WTIMIT will be presented at LREC2010.
2010 Subscription Members are automatically sent all MY2010 data as it
is released. 2010 Standard Members are entitled to request 16 corpora
for free from MY2010. Non-members may license most data for
research-use only.
[ top <#top>]
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100301/03379742/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list