[Corpora-List] New Corpora from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Fri Apr 28 20:56:51 UTC 2006


LDC2006S16
*CSLU Spoltech Brazilian Portuguese Version 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S16>*

LDC2006T09
*Korean Treebank Annotations Version 2.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T09>*

LDC2006S13
*N4 NATO Native and Non-Native Speech 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S13>*

LDC2006T08
*Timebank 1.2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T08>*


The Linguistic Data Consortium (LDC) is pleased to announce the 
availability of four new publications.

------------------------------------------------------------------------

*New LDC Publications

*

(1)  The CSLU Spoltech Brazilian Portuguese 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S16> 
corpus contains microphone speech from a variety of regions in Brazil 
with phonetic and orthographic transcriptions. The utterances consist of 
both read speech (for phonetic coverage) and responses to questions (for 
spontaneous speech). The corpus contains 477 speakers and 8080 separate 
utterances. A total of 2540 utterances have been transcribed at the word 
level (without time alignments), and 5479 utterances have been 
transcribed at the phoneme level (with time alignments).

The data have been recorded at 44.1 kHz (mono, 16 bit) and stored in 
RIFF format. The recording was conducted with a direct connection from 
the microphone to the sound card. The sound card was 
SoundBlaster-compatible. For the prompted sentences, the sentence was 
hidden from view when recording began, so that the speaker might utter 
the sentence more naturally. Verification of the recording quality was 
performed immediately after each utterance recording; the 
data-collection software allowed the speaker to re-record utterances in 
case the recording was not of sufficient quality. The acoustic 
environment was not controlled, in order to allow for background 
conditions that would occur in application environments. 


*
*
*(2)  The Korean Treebank Annotations Version 2.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T09> 
is an extension of the Korean English Treebank Annotations corpus, 
LDC2002T26 (2002). It is essentially an electronic corpus of Korean 
texts annotated with morphological and syntactic information. The 
original texts for the Korean Treebank 2.0 were selected from The Korean 
Newswire corpus published by LDC, catalog number LDC2000T45, which is a 
collection of Korean Press Agency news articles from June 2, 1994 to 
March 20, 2000. Korean Treebank 2.0 is based on the March 2000 portion 
of the corpus and includes 647 articles. The annotated corpus can find 
many uses, including training of morphological analyzers, part-of-speech 
taggers and syntactic parsers. 

*

(3)  The N4 NATO Native and Non-Native Speech 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S13> 
corpus was developed by the NATO research group on Speech and Language 
Technology in order to provide a military oriented database for 
multilingual and non-native speech processing studies.  The NATO Speech 
and Language Technology group decided to create a corpus geared towards 
the study of non-native accents. The group chose naval communications as 
the common task because it naturally includes a great deal of non-native 
speech and because there were training facilities where data could be 
collected in several countries.

Speech data was recorded in the Naval transmission training centers of 
four countries (Germany, The Netherlands, United Kingdom, and Canada). 
The material consists of native and non-native speakers speakers using 
NATO English procedure between ships and reading from a text, "The North 
Wind and the Sun" in both English and the speaker's native language.

*

(4) The TimeBank 1.2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T08> 
corpus contains 183 news articles that have been annotated with temporal 
information, adding events, times and temporal links between events and 
times. The annotation follows the TimeML 1.2.1 specification.  The most 
recent information on TimeML is always available at www.timeml.org 
<http://www.timeml.org>.

TimeML aims to capture and represent temporal information. This is 
accomplished using four primary tag types: TIMEX3 for temporal 
expressions, EVENT for temporal events, SIGNAL for temporal signals, and 
LINK for representing relationships.  Timebank 1.2 is distributed via 
web download.

Nonmembers may also license this data at *no cost* - please note that a 
signed copy of our generic nonmember user agreement 
<http://www.ldc.upenn.edu/Catalog/nonmem_agree/generic.license.html> is 
required.


------------------------------------------------------------------------

If you need further information, or would like to inquire about 
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573 
1275.



--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                  http://www.ldc.upenn.edu


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20060428/e2c503b5/attachment.htm>


More information about the Corpora mailing list