[Corpora-List] New from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Fri Jan 4 20:04:29 UTC 2008
*- Chinese Treebank 6.0 (CTB 6.0)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T36> -*
*- **2004 Spring NIST Rich Transcription (RT-04S) Development Data*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S11>* -
*
*The Linguistic Data Consortium (LDC) would like to announce the
availability of two new publications.
*
------------------------------------------------------------------------
*
*
*New Publications*
(1) The Chinese Treebank project began at the University of Pennsylvania
in 1998 and continues at Penn and the University of Colorado. Chinese
Treebank 6.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T36>
is the latest version produced from this effort, consisting of 780,000
words (over 1.28 million Chinese characters) that are segmented,
part-of-speech tagged and fully bracketed. The data sources include
newswire from Xinhua News Agency, articles from Sinorama Magazine, news
from the website of the Hong Kong Special Administrative Region and
transcripts from various broadcast news programs.
This release encompasses 2,036 text files, containing 28,295 sentences,
781,351 words and 1,285,149 hanzi (Chinese characters). The data is
provided in two encodings: GBK and UTF-8, and the annotation has Penn
Treebank-style labeled brackets. The data is provided in four different
formats: raw text, word segmented, word segmented and POS-tagged, and
syntactically bracketed.
***
(2) The 2004 Spring NIST Rich Transcription (RT-04S) Development Data
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S11>
contains the test material (meeting speech and reference transcripts)
used in the RT-04S evaluation administered by the NIST (National
Institute of Standards and Technology) Speech Group
<http://www.nist.gov/speech>. Rich Transcription (RT) is broadly defined
as a fusion of speech-to-text technology and metadata extraction
technologies designed to provide the basis for a generation of more
usable transcriptions of human-human meeting speech.
The RT-04S development data consists of approximately 10 minutes of
recordings of eight meetings held at ISCI, CMU, LDC and NIST. Although
the development data is comprised of 10-minute excerpts from the same
data collection sites which are represented in LDC2007S12 2004 Spring
NIST Rich Transcription (RT-04S) Evaluation Data
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S12>,
it is not completely reflective of the evaluation test data since it
contains lapel mics in lieu of head mics for the LDC and CMU data and
some different distant mics for LDC data.
RT-04S included the following tasks in the meeting domain:
Speech-to-Text Transcription (STT) tasks
Microphone conditions:
· Multiple distant microphones
· Single distant microphone
· Individual head microphone
Processing time conditions:
· Unlimited time STT
· Less than or equal to twenty times realtime
· Less than or equal to ten times realtime
· Less than or equal to one times realtime
Diarization (SPKR) task (who spoke when)
Microphone conditions:
· Multiple distant microphones
· Single distant microphone
Input conditions:
· Speech input only
· Speech plus reference transcript input
Processing time conditions:
· Unlimited time
· Less than or equal to twenty times realtime
· Less than or equal to ten times realtime
· Less than or equal to one time realtime
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
*
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080104/d2b7269f/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list