[Corpora-List] News from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Wed Apr 5 20:29:30 UTC 2006
*Agreement between AsiaNet and LDC
* LDC2006S15
*CSLU: Spelled and Spoken Words*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S15>
LDC2006T03
*Korean Propbank*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T03>
LDC2006S30
*Speech Controlled Computing*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S30>
The Linguistic Data Consortium (LDC) would like to highlight recent
developments and announce the availability of three new publications.
------------------------------------------------------------------------
*Agreement between AsiaNet and LDC*
LDC has recently entered into a data license agreement with AsiaNet, a
consortium of Asia Pacific news agencies headquartered in Australia.
AsiaNet translates and distributes full text (unedited) press releases
to all forms of media worldwide through its Asia Pacific agencies and
affiliates in the US, Canada and Europe. AsiaNet also has the capacity
to deliver images, audio and video releases.
The LDC/AsiaNet agreement gives LDC access to AsiaNet's multilingual
texts. LDC is already utilizing AsiaNet's Urdu and Thai texts in the
Less Commonly Taught Languages (LCTL) project.
LDC and AsiaNet look forward to a long and fruitful association -
mutually supporting language-related education, research and technology
development. As it strengthens its ties with the LDC and becomes more
widely known, AsiaNet hopes to attract interest in its services through
its news agency contacts at http://www.asianetnews.net/home.asp
*New Publications from the LDC
*
The CSLU: Spelled and Spoken Words
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S15>
corpus consists of spelled and spoken words. 3647 callers were prompted
to to say and spell their first and last names, to say what city they
grew up in and what city they were calling from, and to answer two
yes/no questions. In order to collect sufficient instances of each
letter, 1371 callers also recited the English alphabet with pauses
between the letters. Each call was transcribed by two people, and all
differences were resolved. In addition, a subset of 2648 calls has been
phonetically labeled.
*
Korean Propbank
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T03>
is a semantic annotation of the Korean English Treebank Annotations and
Korean Treebank version 2.0. Each verb and adjective occurring in the
Treebank has been treated as a semantic predicate and the surrounding
text has been annotated for arguments and adjuncts of the predicate. The
verbs and adjectives have also been tagged with coarse grained senses.
There are two basic components to Korean Propbank:
* The Verb Lexicon. A frames file, consisting of one or more frame
sets, has been created for each predicate occurring in the
Treebank. These files serve as a reference for the annotators and
for users of the data. 2,749 such files have been created.
* The Annotation. There are two annotation files. The
virginia-verbs.pb file has 9,588 annotated predicate tokens. These
predicate tokens include all those occurring in over 54 thousand
words of the Korean English Treebank Annotations, totaling ~791 KB
of uncompressed data. The newswire-verbs.pb file has 23,707
annotated predicate tokens. These predicate tokens include all
those occurring in over 131 thousand words of the Korean Treebank
version 2.0.
*
The Speech Controlled Computing
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S30>
corpus was designed to support the development of small footprint,
embedded ASR applications in the domain of voice control for the home.
It consists of the recordings of 125 speakers of American English from
four regions, three age groups and two gender groups, pronouncing
isolated words. The recordings were conducted in a sound-attenuated
room, and a high-quality microphone was used. Each speaker read a
randomized word list consisting of 2100 words (100 distinct words
appearing 21 times each).
**NOTE: Nonmembers may obtain a commercial rights license to Speech
Controlled Computing for US$7000 by signing the LDC User License
Agreement for Speech Controlled Computing
<http://www.ldc.upenn.edu/Catalog/mem_agree/SCC_User_Agreement.htm>.
For-Profit Membership to the LDC is not required.**
------------------------------------------------------------------------
If you need further information, or would like to inquire about
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573
1275.
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20060405/605e91b2/attachment.htm>
More information about the Corpora
mailing list