[Corpora-List] News from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Wed Apr 5 20:29:30 UTC 2006


*Agreement between AsiaNet and LDC

* LDC2006S15
*CSLU:  Spelled and Spoken Words* 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S15>

LDC2006T03
*Korean Propbank* 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T03>

LDC2006S30
*Speech Controlled Computing* 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S30>

The Linguistic Data Consortium (LDC) would like to highlight recent 
developments and announce the availability of three new publications.

------------------------------------------------------------------------

*Agreement between AsiaNet and LDC*


LDC has recently entered into a data license agreement with AsiaNet, a 
consortium of Asia Pacific news agencies headquartered in Australia. 
AsiaNet translates and distributes full text (unedited) press releases 
to all forms of media worldwide through its Asia Pacific agencies and 
affiliates in the US, Canada and Europe. AsiaNet also has the capacity 
to deliver images, audio and video releases.

The LDC/AsiaNet agreement gives LDC access to AsiaNet's multilingual 
texts. LDC is already utilizing AsiaNet's Urdu and Thai texts in the 
Less Commonly Taught Languages (LCTL) project.

LDC and AsiaNet look forward to a long and fruitful association - 
mutually supporting language-related education, research and technology 
development. As it strengthens its ties with the LDC and becomes more 
widely known, AsiaNet hopes to attract interest in its services through 
its news agency contacts at http://www.asianetnews.net/home.asp


*New Publications from the LDC

*

The CSLU:  Spelled and Spoken Words 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S15> 
corpus consists of spelled and spoken words. 3647 callers were prompted 
to to say and spell their first and last names, to say what city they 
grew up in and what city they were calling from, and to answer two 
yes/no questions. In order to collect sufficient instances of each 
letter, 1371 callers also recited the English alphabet with pauses 
between the letters. Each call was transcribed by two people, and all 
differences were resolved. In addition, a subset of 2648 calls has been 
phonetically labeled. 


      *

Korean Propbank 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T03> 
is a semantic annotation of the Korean English Treebank Annotations and 
Korean Treebank version 2.0. Each verb and adjective occurring in the 
Treebank has been treated as a semantic predicate and the surrounding 
text has been annotated for arguments and adjuncts of the predicate. The 
verbs and adjectives have also been tagged with coarse grained senses. 

There are two basic components to Korean Propbank:

    * The Verb Lexicon. A frames file, consisting of one or more frame
      sets, has been created for each predicate occurring in the
      Treebank. These files serve as a reference for the annotators and
      for users of the data. 2,749 such files have been created.
    * The Annotation. There are two annotation files. The
      virginia-verbs.pb file has 9,588 annotated predicate tokens. These
      predicate tokens include all those occurring in over 54 thousand
      words of the Korean English Treebank Annotations, totaling ~791 KB
      of uncompressed data. The newswire-verbs.pb file has 23,707
      annotated predicate tokens. These predicate tokens include all
      those occurring in over 131 thousand words of the Korean Treebank
      version 2.0.


*

The Speech Controlled Computing 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S30> 
corpus was designed to support the development of small footprint, 
embedded ASR applications in the domain of voice control for the home. 
It consists of the recordings of 125 speakers of American English from 
four regions, three age groups and two gender groups, pronouncing 
isolated words. The recordings were conducted in a sound-attenuated 
room, and a high-quality microphone was used. Each speaker read a 
randomized word list consisting of 2100 words (100 distinct words 
appearing 21 times each).


**NOTE:  Nonmembers may obtain a commercial rights license to Speech 
Controlled Computing for US$7000 by signing the LDC User License 
Agreement for Speech Controlled Computing 
<http://www.ldc.upenn.edu/Catalog/mem_agree/SCC_User_Agreement.htm>.  
For-Profit Membership to the LDC is not required.**

------------------------------------------------------------------------

If you need further information, or would like to inquire about 
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573 
1275.



--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                  http://www.ldc.upenn.edu


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20060405/605e91b2/attachment.htm>


More information about the Corpora mailing list