[Corpora-List] News from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Mon Mar 1 16:21:02 UTC 2010


/New Publications:
/

LDC2010S01*
*- *Fisher Spanish Speech* - <#speech>

LDC2010T04*
- Fisher Spanish - Transcripts - <#transcripts>*

/Other news:/

*- 65,000th LDC Corpus Distributed! -* <#65>

*- 2010 Publications Pipeline -* <#pipeline>

** 
<imap://ldc@mail.ldc.upenn.edu:993/fetch%3EUID%3E/INBOX%3E753331#pipeline>

------------------------------------------------------------------------


*New Publications*


(1)  Fisher Spanish Speech 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010S01> 
was developed by LDC and consists of audio files covering roughly 163 
hours of telephone speech from 136 native Caribbean Spanish and 
non-Caribbean Spanish speakers. Full orthographic transcripts of these 
audio files are available in Fisher Spanish - Transcripts (LDC2010T04) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T04>.

The Fisher telephone conversation collection protocol was created at LDC 
to address a critical need of developers trying to build robust 
automatic speech recognition (ASR) systems. Under the Fisher protocol, a 
very large number of participants each make a few calls of short 
duration speaking to other participants, whom they typically do not 
know, about assigned topics. This maximizes inter-speaker variation and 
vocabulary breadth although it also increases formality.  Previous 
protocols such as CALLHOME, CALLFRIEND and Switchboard relied upon 
participant activity to drive the collection. Fisher is unique in being 
platform driven rather than participant driven. Participants who wish to 
initiate a call may do so; however the collection platform initiates the 
majority of calls. Participants need only answer their phones at the 
times they specified when registering for the study.

To encourage a broad range of vocabulary, Fisher participants are asked 
to speak on an assigned topic which is selected at random from a list, 
which changes every 24 hours and which is assigned to all subjects 
paired on that day. Some topics are inherited or refined from previous 
Switchboard studies while others were developed specifically for the 
Fisher protocol.

In collecting data for this corpus, attempts were made to provide a 
representative distribution of subjects across a variety of demographic 
categories including: gender, age, dialect region, and education level.  
Native speakers of Caribbean Spanish and non-Caribbean Spanish were 
recruited from within the continental United States and Puerto Rico.

The speech recordings consist of 819 telephone conversations of 10 to 12 
minutes in duration. They are provided as digital audio files in NIST 
SPHERE format (1024-byte ASCII file headers). The conversations were 
recorded as 2-channel mu-law sample data with 8000 samples per second 
(as captured from the public telephone network).


[ top <#top>]

*

(2) Fisher Spanish - Transcripts 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T04> 
was developed by LDC and contains full orthographic transcripts of the 
telephone speech in Fisher Spanish Speech (LDC2010S01) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010S01>. 
Transcripts cover roughly 163 hours of telephone speech from 136 native 
Caribbean Spanish and non-Caribbean Spanish speakers.

The transcript files are in plain-text, tab-delimited format (tdf) with 
UTF-8 character encoding. They were created with the LDC-developed 
transcription tool "XTrans" <http://www.ldc.upenn.edu/tools/XTrans/>, 
which allowed for improved handling of multi-channel audio and 
overlapping speakers. XTrans is available from LDC.

Transcribers followed LDC's Transcription Guidelines (NQTR) 
<imap://ldc@mail.ldc.upenn.edu:993/fetch%3EUID%3E/INBOX%3E751173/;section=2.2?part=1.1.2&filename=trans_guide_nqrt_span.doc>, 
which are included with the documentation for this release.

Fisher Spanish Speech (LDC2010S01) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010S01> 
provides the digital audio used as the basis for the transcriptions in 
this corpus, in the form of 2-channel mu-law sample data with 8000 
samples per second (as captured from the public telephone network), for 
819 telephone conversations of 10 to 12 minutes in duration. The audio 
files are in NIST SPHERE format (1024-byte ASCII file headers).

[ top <#top>]

*65,000th LDC Corpus Distributed!*

*
*LDC has recently reached another milestone.  Two years after having 
distributed our 50,000th corpus, we have just distributed our 65,000th!  
To help us celebrate, we took the names of all the organizations that 
had licensed data on the day we distributed our 65,000th corpus and 
tossed them into a Phillies baseball cap. 

We then randomly drew a name, and the winner is ...Swarthmore College 
and Universidad Carlos III de Madrid!  That's not a typo, we have two 
lucky winners!  We are celebrating our 65,000th distribution by awarding 
a benefit of US$2000 each to both Swarthmore College and Universidad 
Carlos III de Madrid. The benefit can be used towards membership or data 
licensing fees at any time this year.

Swarthmore College and Universidad Carlos III de Madrid join our other 
recipients of landmark corpora distributions:

    *     Helsinki University of Technology, Adaptive Informatics
      Research Centre (AIRC) - licensed our 50,000th distribution in
      January 2008.
    *     Instituto de Engenharia de Sistemas e Computadores (INESC) -
      licensed our 40,000th distribution in November 2006.
    *     University of Hawai'i, Manoa, Language Analysis and
      Experimentation Laboratories - licensed our 15,000th distribution
      in April 2002.

We would like to thank both members and non-members for helping the LDC 
reach this landmark distribution. The unceasing demand for LDC data from 
over 2800 organizations supports our mission to develop and share 
resources for research in human language technologies. 


About our winners:

    Swarthmore College ~ The Department of Computer Science offers
    courses that emphasize the fundamental concepts of computer science,
    treating today's languages and systems as current examples of the
    underlying concepts. By educating students to think conceptually, we
    are preparing them to adapt to developments in this dynamic field.

    Universidad Carlos III de Madrid ~ The Multimedia Processing Group
    aims to make a significant research contribution to the field of
    multimedia processing, especially focusing on combining signal
    analysis tools with emerging machine learning methods. Projects
    include automatic multimedia indexing, automatic speech recognition,
    and last-generation video coding.


[ top <#top>]


***2010 Publications Pipeline*

For Membership Year 2010 (MY2010), we anticipate releasing a varied 
selection of publications. Many publications are still in development, 
but here is a glimpse of what is in the pipeline for MY2010.  Please 
note that this list is tentative and subject to modifications.  Our 
planned publications for the coming months include:

    /Arabic Treebank: Part 3 v 3.2/ ~ a revision of Arabic Treebank:
    Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis (LDC2005T20).
    The full Arabic Treebank:  Part 3 has been revised according to the
    new Arabic Treebank annotation guidelines.  The Arabic Treebank
    project consists of two distinct phases: (a) Part-of-Speech (POS)
    tagging which divides the text into lexical tokens, and gives
    relevant information about each token such as lexical category,
    inflectional features, and a gloss, and (b) Arabic Treebanking which
    characterizes the constituent structures of word sequences, provides
    categories for each non-terminal node, and identifies null elements,
    co-reference, traces, etc. on-terminal node. Arabic Treebank:  Part
    3 v 3.2 consists of 599 newswire stories from An Nahar.           
                               

    /Chinese Treebank 7.0/ ~ this release encompasses 2400 text files,
    containing 45000 sentences, 1.1 million words and 1.65 million hanzi
    (Chinese characters). The data is provided in two encodings: GBK and
    UTF-8, and the annotation has Penn Treebank-style labeled
    brackets.       

    /Chinese Web 5-gram Version 1/ ~ contains n-grams (unigrams to
    five-grams) and their observed counts in 880 billion tokens of
    Chinese web data collected in March 2008. All text was converted to
    UTF-8. A simple segmenter using the same algorithm used to generate
    the data is included. The set contains 3.9 billion n-grams total.

    /NPS Chat Corpus Version 1.0/ ~ consists of 10,567 posts gathered
    from age-specific chat rooms. Each file is a recording transcript
    from one of these chat rooms for a short period on a particular day.
      In order to comply with the chat services' terms of service, the
    posts have been privacy-masked.   Each post is annotated with a chat
    dialog-act tag, and individual tokens within each post are annotated
    with part-of-speech tags.

    /WTIMIT/  ~ is a mobile wideband (i.e., 50 Hz -- 7kHz) telephone
    adjunct to TIMIT (LDC93S1).   WTIMIT has been derived as follows:
    the original TIMIT speech files at 16 kHz sampling rate were
    concatenated to 11 signal chunks each being preceded by a 4 second
    calibration tone. These speech chunks were transmitted via two
    prepared Nokia 6220 mobile phones over T-Mobile's 3G wideband mobile
    network in The Hague, The Netherlands, employing the Adaptive
    Multirate Wideband (AMR-WB) speech codec. After data acquisition and
    deconcatenation by maximizing the normalized cross-correlation with
    the original speech files, a database was obtained that is time
    aligned with the original TIMIT data with good precision.
    Accordingly, all TIMIT label files can still be used.  WTIMIT is
    suitable for research on speech quality and intelligibility, and
    investigations on possible wideband upgrades of network-sided IVR
    systems with retrained or bandwidth extended acoustic models for
    automatic speech recognition.  WTIMIT will be presented at LREC2010.

2010 Subscription Members are automatically sent all MY2010 data as it 
is released.  2010 Standard Members are entitled to request 16 corpora 
for free from MY2010.   Non-members may license most data for 
research-use only.

[ top <#top>]


------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100301/03379742/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list