[Corpora-List] New from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Fri Jan 4 20:04:29 UTC 2008


*-  Chinese Treebank 6.0 (CTB 6.0) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T36>  -*


*-  **2004 Spring NIST Rich Transcription (RT-04S) Development Data* 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S11>*  -

*

*The Linguistic Data Consortium (LDC) would like to announce the 
availability of two new publications.
*

------------------------------------------------------------------------

*
*

*New Publications*

(1) The Chinese Treebank project began at the University of Pennsylvania 
in 1998 and continues at Penn and the University of Colorado. Chinese 
Treebank 6.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T36> 
is the latest version produced from this effort, consisting of 780,000 
words (over 1.28 million Chinese characters) that are segmented, 
part-of-speech tagged and fully bracketed. The data sources include 
newswire from Xinhua News Agency, articles from Sinorama Magazine, news 
from the website of the Hong Kong Special Administrative Region and 
transcripts from various broadcast news programs.

This release encompasses 2,036 text files, containing 28,295 sentences, 
781,351 words and 1,285,149 hanzi (Chinese characters). The data is 
provided in two encodings: GBK and UTF-8, and the annotation has Penn 
Treebank-style labeled brackets.  The data is provided in four different 
formats: raw text, word segmented, word segmented and POS-tagged, and 
syntactically bracketed. 

***

 

(2)  The 2004 Spring NIST Rich Transcription (RT-04S) Development Data 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S11> 
contains the test material (meeting speech and reference transcripts) 
used in the RT-04S evaluation administered by the NIST (National 
Institute of Standards and Technology) Speech Group 
<http://www.nist.gov/speech>. Rich Transcription (RT) is broadly defined 
as a fusion of speech-to-text technology and metadata extraction 
technologies designed to provide the basis for a generation of more 
usable transcriptions of human-human meeting speech.

The RT-04S development data consists of approximately 10 minutes of 
recordings of eight meetings held at ISCI, CMU, LDC and NIST. Although 
the development data is comprised of 10-minute excerpts from the same 
data collection sites which are represented in LDC2007S12 2004 Spring 
NIST Rich Transcription (RT-04S) Evaluation Data 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S12>, 
it is not completely reflective of the evaluation test data since it 
contains lapel mics in lieu of head mics for the LDC and CMU data and 
some different distant mics for LDC data.

RT-04S included the following tasks in the meeting domain:

Speech-to-Text Transcription (STT) tasks

    Microphone conditions:
    ·         Multiple distant microphones
    ·         Single distant microphone
    ·         Individual head microphone

    Processing time conditions:
    ·         Unlimited time STT
    ·         Less than or equal to twenty times realtime
    ·         Less than or equal to ten times realtime
    ·         Less than or equal to one times realtime


Diarization (SPKR) task (who spoke when)

    Microphone conditions:
    ·         Multiple distant microphones
    ·         Single distant microphone

    Input conditions:
    ·         Speech input only
    ·         Speech plus reference transcript input

    Processing time conditions:
    ·         Unlimited time
    ·         Less than or equal to twenty times realtime
    ·         Less than or equal to ten times realtime
    ·         Less than or equal to one time realtime

------------------------------------------------------------------------


Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------

*
Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                  http://www.ldc.upenn.edu*

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080104/d2b7269f/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list