[Corpora-List] New from the LDC

Mon Jan 26 20:46:47 UTC 2009

LDC2009S01
-  *CSLU: Numbers Version 1.3* 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S01>  -

 LDC2009T01
*-  English CTS Treebank with Structural Metadata 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T01>  -
*

LDC2009T02
*-  GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T02>  -
*

The Linguistic Data Consortium (LDC) would like to announce the 
availability of three new publications.*
*

------------------------------------------------------------------------

*New Publications
*

(1) CSLU: Numbers Version 1.3 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S01> 
was created by the Center for Spoken Language Understanding (CSLU) at 
OGI School of Science and Engineering, Oregon Health and Science 
University, Beaverton, Oregon. It is a collection of naturally produced 
numbers taken from utterances in various CSLU telephone speech data 
collections. The corpus consists of approximately fifteen hours of 
speech and includes isolated digit strings, continuous digit strings, 
and ordinal/cardinal numbers.

The numbers have several sources, among them, phone numbers, numbers 
from street addresses and zip codes, uttered by 12618 speakers in a 
total of 23902 files. In most of CSLU's telephone data collections, 
callers were asked for their phone number, date of birth, or zip code. 
Callers would also occasionally leave numbers in the midst of another 
utterance. The numbers in those situations were extracted from the host 
utterance and added to the corpus.

Each file includes an orthographic transcription following the CSLU 
Labeling guidelines which are included in the documentation for this 
publication. Also, many of the utterances have been phonetically labeled.

*

(2) English CTS Treebank with Structural Metadata 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T01> 
consists of metadata and syntactic structure annotations for 144 English 
telephone conversations, or 140,000 words, from data used in the EARS 
(Effective, Affordable, Reusable Speech-to-Text program 
<http://projects.ldc.upenn.edu/EARS/>. English CTS Treebank with 
Structural Metadata was created to support EARS work in English. It 
applies EARS metadata extraction annotations and Penn Treebank methods 
to conversations from Switchboard-1 Release 2 (LDC97S62) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S62> 
and from data collected for EARS under the Fisher Protocol*.**

*The purpose of the EARS program was to develop robust speech 
recognition technology to address a range of languages and speaking 
styles. LDC provided conversational and broadcast speech and 
transcripts, annotations, lexicons and texts for language modeling in 
each of the EARS languages (Arabic, Chinese, English). LDC also 
supported a metadata extraction (MDE) research evaluation 
<http://projects.ldc.upenn.edu/MDE>, the goal of which was to enable 
technology to take raw speech-to-text (STT) output and refine it into 
forms of more use to humans and to downstream automatic processes. In 
simple terms, this means the creation of automatic transcripts that are 
maximally readable.

/Structural Metadata Annotation/:  The Fisher data was carefully 
transcribed by LDC staff using RT-04 Transcription Specification, 
Version 3.1 
<http://projects.ldc.upenn.edu/Transcription/rt-04/RT-04-guidelines-V3.1.pdf>; 
for the Switchboard data, transcripts developed at the Institute for 
Signal and Information Processing at Mississippi State University were 
used. The transcribed data was annotated to SimpleMDE V6.2 
<http://projects.ldc.upenn.edu/MDE/Guidelines/SimpleMDE_V6.2.pdf>, an 
annotation task defined by LDC that consisted of the following elements: 
Edit Disfluencies (repetitions, revisions, restarts and complex 
disfluencies), Fillers (including, e.g., filled pauses and discourse 
markers) and SUs, or syntactic/semantic units.

/Parsing and Treebank Annotation/:  The existing MDE annotations were 
converted from RTTM format into a format appropriate for the automatic 
parser, enabling the generation of accurate parses in a form that would 
require as little hand modification by the Treebank team as possible. 
RTTM is a format developed by NIST (National Institute for Standards and 
Technology) for the EARS program that labeled each token in the 
reference transcript according to the properties it displays (e.g., 
lexeme versus non-lexeme, edit, filler, SU). The initial parse trees 
were produced using an entropy-based parser 
<http://www.ldc.upenn.edu/Catalog/docs/LDC2000T43/parser.pdf>.  These 
parses served as the starting point for a manual process which corrected 
the initial pass for each conversation.

To provide high quality parses, scripts were used to separate the edited 
material from the fluent part of each SU prior to parsing it using the 
MDE annotations. The edits were then parsed and reinserted into the tree 
for presentation to the annotators. Manual treebank annotation was 
performed in accordance with existing treebank guidelines for 
conversational telephone speech as well as in accordance with revised 
general guidelines for treebanking.

***

(3)  GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T02> 
contains transcripts and English translations of 20.4 hours of Chinese 
broadcast conversation programming from China Central TV (CCTV) and 
Phoenix TV. It does not contain the audio files form which the 
transcripts and translations were generated. GALE Phase 1 Chinese 
Broadcast Conversation Parallel Text - Part 1, along with other corpora, 
was used as training data in year 1 (Phase 1) of the DARPA-funded GALE 
program. 

A total of 20.4 hours of Chinese broadcast conversation programming were 
selected from two sources: CCTV (a broadcaster from Mainland China), and 
Phoenix TV (a Hong Kong -based satellite TV station). The transcripts 
and translations represent recordings of eight different programs.  A 
manual selection procedure was used to choose data appropriate for the 
GALE program, namely, conversation (talk) programs focusing on current 
events. Stories on topics such as sports, entertainment and business 
were excluded from the data set.

The selected audio snippets were carefully transcribed by LDC annotators 
and professional transcription agencies following LDC's Quick Rich 
Transcription specification. Manual sentence units/segments (SU) 
annotation was also performed as part of the transcription task. Three 
types of end of sentence SU were identified: statement SU, question SU, 
and incomplete SU.

After transcription and SU annotation, files were reformatted into a 
human-readable translation format and assigned to professional 
translators for careful translation. Translators followed LDC's GALE 
Translation guidelines which describe the makeup of the translation 
team, the source data format, the translation data format, best 
practices for translating certain linguistic features (such as names and 
speech disfluencies) and quality control procedures applied to completed 
translations.

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090126/9012e140/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora