[Corpora-List] New Data from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Mon Sep 22 21:19:16 UTC 2008


*-  Release of Additional NomBank Files  -*

*-  Switchboard Dialog Act Corpus Now Available  -*

LDC2008T17
*-  CALLHOME Mandarin Chinese Transcripts - XML version 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T17>  -*

LDC2008S07*
-  CSLU: ISOLET Spoken Letter Database Version 1.3 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S07>  -
*

LDC2008T18
* -  GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T18>  -


*
------------------------------------------------------------------------
*
*

*Release of Additional NomBank Files*

NomBank is an annotation project at New York University which provides 
argument structure for instances of common nouns in Treebank-2 (LDC95T7) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T7>  and 
Treebank-3 (LDC99T42) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42>, 
also known as the 'Penn Treebanks'.  Last December, the project released 
NomBank.1.0 which covers all the "markable" nouns in the Wall Street 
Journal material in the Penn Treebanks.   That release included a total 
of 114,576 propositions derived from looking at a total of 202,965 noun 
instances and choosing only those nouns whose arguments occur in the 
text.  NomBank and related resources are available from the NomBank 
<http://nlp.cs.nyu.edu/meyers/NomBank.html> project website.

The LDC is now making available additional NomBank data which have been 
restricted due to licensing arrangements with their owners. Those files 
are as follows:

    * *NomBank v 1.0 (LDC2008T23)
      <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T23>*
          o a complete printout of NomBank in human-readable form. 

                A license to either Treebank-2 (LDC95T7) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T7>or 
Treebank-3 (LDC99T42) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42> is 
required to obtain NomBank v1.0.

    * *COMNOM v 1.0 (LDC2008T24)*
      <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T24>
          o COMNOM is created by automatically adding classes to COMLEX
            Syntax on the basis of NOMLEX-PLUS.  For details, please see
            the document entitled "Those Other NomBank Dictionaries
            <http://nlp.cs.nyu.edu/meyers/nombank/those-other-nombank-dictionaries.pdf>".


    A license to COMLEX English Syntax Lexicon (LDC98L21)
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98L21>
    or COMLEX Syntax Text Corpus Version 2.0 (LDC96T11)
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96T11>
    is required to obtain COMNOM v 1.0.

 

All requests for these files can be directed to ldc at ldc.upenn.edu 
<mailto:ldc at ldc.upenn.edu>


*Switchboard Dialog Act Corpus Now Available

*

The Switchboard Dialog Act Corpus is a version of the Switchboard-1 
Release 2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S62> 
corpus of telephone conversations tagged with a shallow discourse 
tagset  of approximately 60 basic dialog act tags and combinations.  The 
discourse tag-set used is an augmentation of the Discourse Annotation 
and Markup System of Labeling (DAMSL) tag-set, and is referred to as the 
'SWBD-DAMSL' labels. These annotations were created in 1997 at the 
University of Colorado at Boulder, with the goal of building better 
language models for automatic speech recognition of the Switchboard 
domain. To that end the label-set incorporates both traditional 
sociolinguistic and discourse-theoretic rhetorical 
relations/adjacency-pairs as well as some more-form-based labels. The 
Switchboard Dialog Act Corpus contains labels for 1155 5-minute 
conversations, comprising 205,000 utterances and 1.4 million words. 

To download this corpus from our ftp server, please visit the LDC 
catalog page for Switchboard-1 Release 2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S62> 
and scroll down to the section entitled 'Updates'.



*New Publications*

(1) LDC's CALLHOME Mandarin Chinese collection includes telephone 
speech, associated transcripts and a lexicon. CALLHOME Mandarin Chinese 
Speech 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96S34> 
consists of 120 unscripted telephone conversations between native 
speakers of Mandarin Chinese. All calls, which lasted up to thirty 
minutes, originated in North America and were placed to locations 
overseas; most participants called family members or close friends. 
CALLHOME Mandarin Chinese Transcripts 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96T16> 
covers a contiguous five or ten-minute segment from each of the 
telephone speech files. The transcripts are in tab-delimited format with 
GB2312 encoding, are timestamped by speaker turn for alignment with the 
speech signal and are provided in standard orthography. CALLHOME 
Mandarin Chinese Lexicon 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96T16> 
is comprised of over 40,000 words from twenty CALLHOME Mandarin 
transcripts.

C 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T17>ALLHOME 
Mandarin Chinese Transcripts - XML Version 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T17>, 
the latest addition to this collection, was created by Lancaster 
University and presents the entire original corpus of 120 transcripts in 
XML format with UTF-8 encoding, retokenization and part-of-speech (POS) 
tagging. The retokenization and POS information were supplied using the 
Chinese Lexical Analysis System (ICTCLAS) developed by the Institute of 
Computing Technology, Chinese Academy of Sciences 
<http://www.ict.ac.cn/english/>, Beijing. ICTCLAS aims to incorporate 
Chinese word segmentation, POS tagging, disambiguation and unknown words 
recognition into a single theoretical framework using multi-layered 
hierarchical hidden Markov models.

In addition to the original applications for Mandarin Chinese CALLHOME 
data (e.g., speech recognition), CALLHOME Mandarin Chinese Transcripts - 
XML Version will be useful in the grammatical study of spoken Mandarin.  
This XML corpus retains all of the linguistic analyses (e.g., 
timestamps, spoken features and proper nouns) from the original 
transcripts release, but the mnemonics used in the original release were 
migrated into XML markup.

All analyses in the original release were retained at the sacrifice of 
tokenization and part-of-speech tagging accuracy (e.g., some mnemonics 
encoding spoken features may split a word, which can affect the tagging 
accuracy). However, the results of the automated processing were 
substantially post-edited.  In addition, a large number of obvious 
typographical errors in the original release were corrected in the 
process of post-editing. 

 

**

*

(2) CSLU: ISOLET Spoken Letter Database Version 1.3 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S07> 
was created by the Center for Spoken Language Understanding (CSLU) at 
OGI School of Science and Engineering, Oregon Health and Science 
University, Beaverton, Oregon.  CSLU: ISOLET Spoken Letter Database 
Version 1.3 is a database of letters of the English alphabet spoken in 
isolation under quiet laboratory conditions and associated transcripts. 
The data was collected in 1990 and consists of two productions of each 
letter by 150 speakers (7800 spoken letters) for approximately 1.25 
hours of speech. The subjects consisted of 75 male speakers and 75 
female speakers; all speakers reported English as their native language. 

Speech was recorded in the OGI speech recognition laboratory and the 
recording equipment was selected to mimic the equipment used to collect 
the TIMIT 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1> 
database as closely as possible. The speech was recorded with a 
Sennheiser HMD 224 noise-canceling microphone, low pass filtered at 7.6 
kHz. Data capture was performed using the AT&T DSP32 board installed in 
a Sun 4/110. The data were sampled at 16 kHz and converted to RIFF(.WAV) 
format.

The transcriptions of the recorded speech are time-aligned phonetic 
transcriptions conforming to the CSLU Labeling standards. Time-aligned 
word transcriptions are represented in a standard orthography or 
romanization. Speech and non-speech phenomena are distinguished. The 
transcriptions are aligned to a waveform by placing boundaries to mark 
the beginning and ending of words. In addition to the specification of 
boundaries, this level of transcription includes additional commentary 
on salient speech and non-speech characteristics, such as 
glottalization, inhalation, and exhalation. 

 

**
*


(3) GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T18> 
contains transcripts and English translations of 19.1 hours of Chinese 
broadcast news programming from Voice of America (VOA), China Central TV 
(CCTV) and Phoenix TV. It does not contain the audio files from which 
the transcripts and translations were generated. GALE Phase 1 Chinese 
Broadcast News Parallel Text - Part 3 is the last the three-part GALE 
Phase 1 Chinese Broadcast News Parallel Text, which, along with other 
corpora, was used as training data in year 1 (Phase 1) of the 
DARPA-funded GALE program. LDC has previously released GALE Phase 1 
Chinese Broadcast News Parallel Text - Part 1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T23> 
and GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T08>.

A total of 19.1 hours of Chinese broadcast news recordings were selected 
from three sources: VOA,  CCTV (a broadcaster from Mainland China) and 
Phoenix TV (a Hong Kong-based satellite TV station).  A manual selection 
procedure was used to choose data appropriate for the GALE program, 
namely, news programs focusing on current events. Stories on topics such 
as sports, entertainment and business were excluded from the data set. 
Manual sentence units/segments (SU) annotation was also performed on a 
subset of files following LDC's Quick Rich Transcription specification. 
Three types of end of sentence SU were identified: statement SU, 
question SU, and incomplete SU.

After transcription and SU annotation, they were reformatted into a 
human-readable translation format, and the files were then assigned to 
professional translators for careful translation. Translators followed 
LDC's GALE Translation guidelines, which describe the makeup of the 
translation team, the source, data format, the translation data format, 
best practices for translating certain linguistic features (such as 
names and speech disfluencies), and quality control procedures applied 
to completed translations.



------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080922/49c6daac/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list