[Corpora-List] New from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Mon Jan 26 20:46:47 UTC 2009
LDC2009S01
- *CSLU: Numbers Version 1.3*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S01> -
LDC2009T01
*- English CTS Treebank with Structural Metadata
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T01> -
*
LDC2009T02
*- GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T02> -
*
The Linguistic Data Consortium (LDC) would like to announce the
availability of three new publications.*
*
------------------------------------------------------------------------
*New Publications
*
(1) CSLU: Numbers Version 1.3
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S01>
was created by the Center for Spoken Language Understanding (CSLU) at
OGI School of Science and Engineering, Oregon Health and Science
University, Beaverton, Oregon. It is a collection of naturally produced
numbers taken from utterances in various CSLU telephone speech data
collections. The corpus consists of approximately fifteen hours of
speech and includes isolated digit strings, continuous digit strings,
and ordinal/cardinal numbers.
The numbers have several sources, among them, phone numbers, numbers
from street addresses and zip codes, uttered by 12618 speakers in a
total of 23902 files. In most of CSLU's telephone data collections,
callers were asked for their phone number, date of birth, or zip code.
Callers would also occasionally leave numbers in the midst of another
utterance. The numbers in those situations were extracted from the host
utterance and added to the corpus.
Each file includes an orthographic transcription following the CSLU
Labeling guidelines which are included in the documentation for this
publication. Also, many of the utterances have been phonetically labeled.
*
(2) English CTS Treebank with Structural Metadata
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T01>
consists of metadata and syntactic structure annotations for 144 English
telephone conversations, or 140,000 words, from data used in the EARS
(Effective, Affordable, Reusable Speech-to-Text program
<http://projects.ldc.upenn.edu/EARS/>. English CTS Treebank with
Structural Metadata was created to support EARS work in English. It
applies EARS metadata extraction annotations and Penn Treebank methods
to conversations from Switchboard-1 Release 2 (LDC97S62)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S62>
and from data collected for EARS under the Fisher Protocol*.**
*The purpose of the EARS program was to develop robust speech
recognition technology to address a range of languages and speaking
styles. LDC provided conversational and broadcast speech and
transcripts, annotations, lexicons and texts for language modeling in
each of the EARS languages (Arabic, Chinese, English). LDC also
supported a metadata extraction (MDE) research evaluation
<http://projects.ldc.upenn.edu/MDE>, the goal of which was to enable
technology to take raw speech-to-text (STT) output and refine it into
forms of more use to humans and to downstream automatic processes. In
simple terms, this means the creation of automatic transcripts that are
maximally readable.
/Structural Metadata Annotation/: The Fisher data was carefully
transcribed by LDC staff using RT-04 Transcription Specification,
Version 3.1
<http://projects.ldc.upenn.edu/Transcription/rt-04/RT-04-guidelines-V3.1.pdf>;
for the Switchboard data, transcripts developed at the Institute for
Signal and Information Processing at Mississippi State University were
used. The transcribed data was annotated to SimpleMDE V6.2
<http://projects.ldc.upenn.edu/MDE/Guidelines/SimpleMDE_V6.2.pdf>, an
annotation task defined by LDC that consisted of the following elements:
Edit Disfluencies (repetitions, revisions, restarts and complex
disfluencies), Fillers (including, e.g., filled pauses and discourse
markers) and SUs, or syntactic/semantic units.
/Parsing and Treebank Annotation/: The existing MDE annotations were
converted from RTTM format into a format appropriate for the automatic
parser, enabling the generation of accurate parses in a form that would
require as little hand modification by the Treebank team as possible.
RTTM is a format developed by NIST (National Institute for Standards and
Technology) for the EARS program that labeled each token in the
reference transcript according to the properties it displays (e.g.,
lexeme versus non-lexeme, edit, filler, SU). The initial parse trees
were produced using an entropy-based parser
<http://www.ldc.upenn.edu/Catalog/docs/LDC2000T43/parser.pdf>. These
parses served as the starting point for a manual process which corrected
the initial pass for each conversation.
To provide high quality parses, scripts were used to separate the edited
material from the fluent part of each SU prior to parsing it using the
MDE annotations. The edits were then parsed and reinserted into the tree
for presentation to the annotators. Manual treebank annotation was
performed in accordance with existing treebank guidelines for
conversational telephone speech as well as in accordance with revised
general guidelines for treebanking.
***
(3) GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T02>
contains transcripts and English translations of 20.4 hours of Chinese
broadcast conversation programming from China Central TV (CCTV) and
Phoenix TV. It does not contain the audio files form which the
transcripts and translations were generated. GALE Phase 1 Chinese
Broadcast Conversation Parallel Text - Part 1, along with other corpora,
was used as training data in year 1 (Phase 1) of the DARPA-funded GALE
program.
A total of 20.4 hours of Chinese broadcast conversation programming were
selected from two sources: CCTV (a broadcaster from Mainland China), and
Phoenix TV (a Hong Kong -based satellite TV station). The transcripts
and translations represent recordings of eight different programs. A
manual selection procedure was used to choose data appropriate for the
GALE program, namely, conversation (talk) programs focusing on current
events. Stories on topics such as sports, entertainment and business
were excluded from the data set.
The selected audio snippets were carefully transcribed by LDC annotators
and professional transcription agencies following LDC's Quick Rich
Transcription specification. Manual sentence units/segments (SU)
annotation was also performed as part of the transcription task. Three
types of end of sentence SU were identified: statement SU, question SU,
and incomplete SU.
After transcription and SU annotation, files were reformatted into a
human-readable translation format and assigned to professional
translators for careful translation. Translators followed LDC's GALE
Translation guidelines which describe the makeup of the translation
team, the source data format, the translation data format, best
practices for translating certain linguistic features (such as names and
speech disfluencies) and quality control procedures applied to completed
translations.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090126/9012e140/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list