[Corpora-List] News from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Mon Dec 12 18:18:46 UTC 2005


**  New LDC Online Membership!  **

LDC2005S26
**  CSLU:  22 Languages Corpus 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S26>  **

LDC2005T34
**  Chinese <-> English Name Entity Lists (v1.0) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T34>  **

LDC2005S30
**  The West Point Company G3 American English Speech Data Corpus 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S30>  *
*

The Linguistic Data Consortium (LDC) would like to announce a new 
membership option, the LDC Online Membership, and provide information 
regarding our new publications.

------------------------------------------------------------------------

*LDC Online Membership*

The Linguistic Data Consortium is pleased to announce the LDC Online 
Membership, which is now available for the 2006 Membership year.  LDC 
Online contains a continuously growing, indexed collection of Arabic, 
Chinese and English newswire text, millions of words of English 
telephone speech from the Switchboard and Fisher collections and the 
American English Spoken Lexicon, as well as the full text of the Brown 
corpus.  With LDC Online, users can search textual data and play audio 
extracts for transcribed utterances on standard web browsers.  LDC will 
continue to add new material to LDC Online, including Spanish, Arabic, 
and Chinese conversational telephone data in 2006.
 
The LDC Online Membership is a reduced cost alternative providing 
interactive access to a growing subset of LDC data to users who do not 
have a need for linguistic data on media.  Current LDC members already 
have access to all LDC Online resources. The LDC Online Membership is 
available to Non-Profit and U.S. government organizations for $1,000 
(USD) per calendar year (January to December).  The obligations and data 
usage restrictions of the LDC Online Membership are contained in the LDC 
Online Membership Agreement 
<http://www.ldc.upenn.edu/Membership/Agreements/LDCOnline.Agrmnt.new.htm>.

We invite you to try LDC Online if you have not already done so. Please 
go to http://online.ldc.upenn.edu for a free, limited demonstration and 
to sign up for a non-member LDC Online account.  To become an LDC Online 
member or to request additional information, contact the LDC Membership 
Department at ldc at ldc.upenn.edu.  

We hope that the LDC Online Membership will enhance your linguistic 
research and your association with the LDC.


*
*
*New Publications
*

(1) The CSLU:  22 Language Corpus 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S26> 
was produced by the Center for Spoken Language Understanding at Oregon 
Health & Science University.  The corpus consists of telephone speech 
from the following languages:  Arabic, Cantonese, Czech, Farsi, German, 
Hindi, Hungarian, Japanese, Korean, Malay, Mandarin, Italian, Polish, 
Portuguese, Russian, Spanish, Swedish, Swahili, Tamil, Vietnamese, and 
English. The corpus contains fixed vocabulary utterances (e.g. days of 
the week) as well as fluent continuous speech. Each of the 50191 
utterances is verified by a native speaker to determine if the caller 
followed instructions when answering the prompts. For this release, 
approximately 19758 utterances have corresponding orthographic 
transcriptions. 

*

(2) Chinese <-> English Name Entity Lists (v1.0) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T34> 
are compiled from Xinhua News Agency articles. This release consists of 
9 pairs of bi-directional lists in the following categories: Person 
Names,  Place Names, Organization Names, Industry Names, Press Names, 
Other Names, and Who is Who Names. The English->Chinese version of each 
pair was created by reversing the Chinese->English, both sorted by the 
Unix built-in sort function. 

*

(3) The West Point Company G3 American English Speech Data Corpus 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S30> 
was produced by Center for Technology Enhanced Language Learning, part 
of the U.S. Military Academy's Department of Foreign Languages. During 
the 2000-2001 academic year, cadets, staff and faculty members at the 
United States Military Academy volunteered to participate in a speech 
data collection project for American English. The goal of the project 
was to amass recordings from no less than one hundred adult speakers, 
fifty males and fifty females, to form a substantial corpus of 
high-quality read speech.

The 185 sentences comprising the data collection script were written to 
elicit examples of all or most all of the possible syllables used in 
spoken American English.  The G3 Corpus audio data comes from 53 female 
and 56 male volunteers, each of whom recorded approximately 104 
utterances. The recordings are sampled at a 16 bit resolution, 22,050 
samples per second. Recordings were made using headset microphones 
(Shure M10) with preamplifiers attached to the line input jack of 
desktop computers. The total amount of speech is about 15 hours. 


------------------------------------------------------------------------


If you need further information, or would like to inquire about 
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573 
1275.


--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
3600 Market Street                             Fax:   (215) 573-2175
Suite 810                             	    	   ldc at ldc.upenn.edu
Philadelphia, PA 19104                 	    http://www.ldc.upenn.edu

du

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20051212/47a169f8/attachment.htm>


More information about the Corpora mailing list