[Corpora-List] New Data from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Mon Sep 22 21:19:16 UTC 2008
*- Release of Additional NomBank Files -*
*- Switchboard Dialog Act Corpus Now Available -*
LDC2008T17
*- CALLHOME Mandarin Chinese Transcripts - XML version
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T17> -*
LDC2008S07*
- CSLU: ISOLET Spoken Letter Database Version 1.3
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S07> -
*
LDC2008T18
* - GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T18> -
*
------------------------------------------------------------------------
*
*
*Release of Additional NomBank Files*
NomBank is an annotation project at New York University which provides
argument structure for instances of common nouns in Treebank-2 (LDC95T7)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T7> and
Treebank-3 (LDC99T42)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42>,
also known as the 'Penn Treebanks'. Last December, the project released
NomBank.1.0 which covers all the "markable" nouns in the Wall Street
Journal material in the Penn Treebanks. That release included a total
of 114,576 propositions derived from looking at a total of 202,965 noun
instances and choosing only those nouns whose arguments occur in the
text. NomBank and related resources are available from the NomBank
<http://nlp.cs.nyu.edu/meyers/NomBank.html> project website.
The LDC is now making available additional NomBank data which have been
restricted due to licensing arrangements with their owners. Those files
are as follows:
* *NomBank v 1.0 (LDC2008T23)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T23>*
o a complete printout of NomBank in human-readable form.
A license to either Treebank-2 (LDC95T7)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T7>or
Treebank-3 (LDC99T42)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42> is
required to obtain NomBank v1.0.
* *COMNOM v 1.0 (LDC2008T24)*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T24>
o COMNOM is created by automatically adding classes to COMLEX
Syntax on the basis of NOMLEX-PLUS. For details, please see
the document entitled "Those Other NomBank Dictionaries
<http://nlp.cs.nyu.edu/meyers/nombank/those-other-nombank-dictionaries.pdf>".
A license to COMLEX English Syntax Lexicon (LDC98L21)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98L21>
or COMLEX Syntax Text Corpus Version 2.0 (LDC96T11)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96T11>
is required to obtain COMNOM v 1.0.
All requests for these files can be directed to ldc at ldc.upenn.edu
<mailto:ldc at ldc.upenn.edu>
*Switchboard Dialog Act Corpus Now Available
*
The Switchboard Dialog Act Corpus is a version of the Switchboard-1
Release 2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S62>
corpus of telephone conversations tagged with a shallow discourse
tagset of approximately 60 basic dialog act tags and combinations. The
discourse tag-set used is an augmentation of the Discourse Annotation
and Markup System of Labeling (DAMSL) tag-set, and is referred to as the
'SWBD-DAMSL' labels. These annotations were created in 1997 at the
University of Colorado at Boulder, with the goal of building better
language models for automatic speech recognition of the Switchboard
domain. To that end the label-set incorporates both traditional
sociolinguistic and discourse-theoretic rhetorical
relations/adjacency-pairs as well as some more-form-based labels. The
Switchboard Dialog Act Corpus contains labels for 1155 5-minute
conversations, comprising 205,000 utterances and 1.4 million words.
To download this corpus from our ftp server, please visit the LDC
catalog page for Switchboard-1 Release 2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S62>
and scroll down to the section entitled 'Updates'.
*New Publications*
(1) LDC's CALLHOME Mandarin Chinese collection includes telephone
speech, associated transcripts and a lexicon. CALLHOME Mandarin Chinese
Speech
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96S34>
consists of 120 unscripted telephone conversations between native
speakers of Mandarin Chinese. All calls, which lasted up to thirty
minutes, originated in North America and were placed to locations
overseas; most participants called family members or close friends.
CALLHOME Mandarin Chinese Transcripts
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96T16>
covers a contiguous five or ten-minute segment from each of the
telephone speech files. The transcripts are in tab-delimited format with
GB2312 encoding, are timestamped by speaker turn for alignment with the
speech signal and are provided in standard orthography. CALLHOME
Mandarin Chinese Lexicon
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96T16>
is comprised of over 40,000 words from twenty CALLHOME Mandarin
transcripts.
C
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T17>ALLHOME
Mandarin Chinese Transcripts - XML Version
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T17>,
the latest addition to this collection, was created by Lancaster
University and presents the entire original corpus of 120 transcripts in
XML format with UTF-8 encoding, retokenization and part-of-speech (POS)
tagging. The retokenization and POS information were supplied using the
Chinese Lexical Analysis System (ICTCLAS) developed by the Institute of
Computing Technology, Chinese Academy of Sciences
<http://www.ict.ac.cn/english/>, Beijing. ICTCLAS aims to incorporate
Chinese word segmentation, POS tagging, disambiguation and unknown words
recognition into a single theoretical framework using multi-layered
hierarchical hidden Markov models.
In addition to the original applications for Mandarin Chinese CALLHOME
data (e.g., speech recognition), CALLHOME Mandarin Chinese Transcripts -
XML Version will be useful in the grammatical study of spoken Mandarin.
This XML corpus retains all of the linguistic analyses (e.g.,
timestamps, spoken features and proper nouns) from the original
transcripts release, but the mnemonics used in the original release were
migrated into XML markup.
All analyses in the original release were retained at the sacrifice of
tokenization and part-of-speech tagging accuracy (e.g., some mnemonics
encoding spoken features may split a word, which can affect the tagging
accuracy). However, the results of the automated processing were
substantially post-edited. In addition, a large number of obvious
typographical errors in the original release were corrected in the
process of post-editing.
**
*
(2) CSLU: ISOLET Spoken Letter Database Version 1.3
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S07>
was created by the Center for Spoken Language Understanding (CSLU) at
OGI School of Science and Engineering, Oregon Health and Science
University, Beaverton, Oregon. CSLU: ISOLET Spoken Letter Database
Version 1.3 is a database of letters of the English alphabet spoken in
isolation under quiet laboratory conditions and associated transcripts.
The data was collected in 1990 and consists of two productions of each
letter by 150 speakers (7800 spoken letters) for approximately 1.25
hours of speech. The subjects consisted of 75 male speakers and 75
female speakers; all speakers reported English as their native language.
Speech was recorded in the OGI speech recognition laboratory and the
recording equipment was selected to mimic the equipment used to collect
the TIMIT
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1>
database as closely as possible. The speech was recorded with a
Sennheiser HMD 224 noise-canceling microphone, low pass filtered at 7.6
kHz. Data capture was performed using the AT&T DSP32 board installed in
a Sun 4/110. The data were sampled at 16 kHz and converted to RIFF(.WAV)
format.
The transcriptions of the recorded speech are time-aligned phonetic
transcriptions conforming to the CSLU Labeling standards. Time-aligned
word transcriptions are represented in a standard orthography or
romanization. Speech and non-speech phenomena are distinguished. The
transcriptions are aligned to a waveform by placing boundaries to mark
the beginning and ending of words. In addition to the specification of
boundaries, this level of transcription includes additional commentary
on salient speech and non-speech characteristics, such as
glottalization, inhalation, and exhalation.
**
*
(3) GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T18>
contains transcripts and English translations of 19.1 hours of Chinese
broadcast news programming from Voice of America (VOA), China Central TV
(CCTV) and Phoenix TV. It does not contain the audio files from which
the transcripts and translations were generated. GALE Phase 1 Chinese
Broadcast News Parallel Text - Part 3 is the last the three-part GALE
Phase 1 Chinese Broadcast News Parallel Text, which, along with other
corpora, was used as training data in year 1 (Phase 1) of the
DARPA-funded GALE program. LDC has previously released GALE Phase 1
Chinese Broadcast News Parallel Text - Part 1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T23>
and GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T08>.
A total of 19.1 hours of Chinese broadcast news recordings were selected
from three sources: VOA, CCTV (a broadcaster from Mainland China) and
Phoenix TV (a Hong Kong-based satellite TV station). A manual selection
procedure was used to choose data appropriate for the GALE program,
namely, news programs focusing on current events. Stories on topics such
as sports, entertainment and business were excluded from the data set.
Manual sentence units/segments (SU) annotation was also performed on a
subset of files following LDC's Quick Rich Transcription specification.
Three types of end of sentence SU were identified: statement SU,
question SU, and incomplete SU.
After transcription and SU annotation, they were reformatted into a
human-readable translation format, and the files were then assigned to
professional translators for careful translation. Translators followed
LDC's GALE Translation guidelines, which describe the makeup of the
translation team, the source, data format, the translation data format,
best practices for translating certain linguistic features (such as
names and speech disfluencies), and quality control procedures applied
to completed translations.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080922/49c6daac/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list