[Corpora-List] New LDC Releases

Tue Jul 1 14:36:13 UTC 2003

     *        LDC2003S01  2001 Communicator Evaluation         *

                 *        LDC2003T10  SAID        *

The Linguistic Data Consortium (LDC) is pleased to announce the
availability of two new publications.

1.  The 2001 Communicator Evaluation is the second publication to result
from the Communicator program.  The original goals of the Communicator
program were to support the creation of speech-enabled interfaces that
scale gracefully across modalities, from speech-only to interfaces that
include graphics, maps, pointing and gesture. The original vision of the
Communicator systems included the ability of a user, during one
ten-minute session, to plan a three-leg trip, with the three
flights/legs on three different days, with rental car and hotel in each
of the two "away" cities, plus dictating/sending a voice-mail message.

The actual research that led to the data collections in 2000 and 2001
explored ways to construct better spoken-dialogue systems, with which
users interact via speech-alone to perform relatively complex tasks such
as travel planning. During 2000 and 2001 two large data sets were
collected, in which users used the Communicator systems built several
sites to do travel planning.  The 2001 Communicator Evaluation
publication consists of all the data from the 2001 collection.

All audio files have been converted into SPHERE format; there are 53394
sphere files, totaling approximately 102 hours of audio. All sphere
files are one-channel, 8KHz, but the sample coding and format, while
consistent for all files belonging to one site, is not consistent across
sites (for example, some sites provided pcm, while others provided ulaw
data). The documentation included in this distribution is replicated
exactly as received from NIST and from the participating sites.  This
publication consists of one DVD.

For further information, including online documentation, please visit:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003S01

Institutions that have membership in the LDC during the 2003
Membership Year will be able to receive this corpus free of charge.
Nonmembers may license this corpus for $900

2. SAID (A Syntactically Annotated Idiom Dataset) provides data for
investigating the structural configurations in which English idioms are
typically found. The assumption is that, since idioms are phrasal
lexical items (PLIs), they will therefore have structural properties
which are idiosyncratic. In order to study the structural properties of
phrasal lexical items, the data is more useful if it is syntactically
annotated.

The data was originally drawn from four dictionaries of English idioms.
There are 13467 phrasal lexical items in this corpus.  The analysis of
the phrasal lexical items was manual, while the bracketing symmetry was
checked computationally.  SAID is available through FTP download.

This corpus was authored by Koenraad Kuiper, Heather McCann, Heidi
Quinn, Therese Aitchison, Kees van der Veer under the sponsorship of the
New Zealand Vice Chancellors' Committee and the University of
Canterbury.

For further information, including online documentation, please visit:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T10

Institutions that have membership in the LDC during the 2003
Membership Year will be able to receive this corpus free of charge.
Nonmembers may license this corpus for $200.

                                  *

If you need additional information before placing your order, or
would like to inquire about membership in the LDC, please send email to
<ldc at ldc.upenn.edu> or call +1 (215) 573-1275.

--------------------------------------------------------------------
Linguistic Data Consortium          Phone:  +1 (215) 573-1275
University of Pennsylvania          Fax:  +1 (215) 573-2175
3600 Market Street, Suite 810       email: ldc at ldc.upenn.edu
Philadelphia, PA 19104-2653         www: http://www.ldc.upenn.edu