[Corpora-List] New from the LDC

Tue May 29 17:39:25 UTC 2007

The Linguistic Data Consortium (LDC) would like to announce the 
availability of three new publications.*
*
LDC2007S08
*-CSLU:Foreign Accented English Release 1.2-* 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S08>

LDC2007T07
*-English Gigaword Third Edition-* 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T07>

LDC2007T21
*-OntoNotes v 1.0-* 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T21>

------------------------------------------------------------------------
*
New Publications *
*
*
(1)*  *CSLU: Foreign Accented English Release 1.2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S08> 
consists of continuous speech in English by native speakers of 22 
different languages: Arabic, Cantonese, Czech, Farsi, French, German, 
Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Mandarin 
Chinese, Malay, Polish, Portuguese (Brazilian and Iberian), Russian, 
Swedish, Spanish, Swahili, Tamil and Vietnamese. The corpus contains 
4925 telephone-quality utterances, as well as information about the 
speakers' linguistic backgrounds and perceptual judgments about the 
accents in the utterances.

For this collection, the speakers were asked to speak about themselves 
in English for 20 seconds. Three native speakers of American English 
independently listened to each utterance and judged the speakers' 
accents on a 4-point scale: negligible/no accent, mild accent, strong 
accent and very strong accent. CSLU: Foreign Accented English Release 
1.2 corpus is intended to support the study of the underlying 
characteristics of foreign accent and to enable research, development 
and evaluation of algorithms for the identification and understanding of 
accented speech. Some of the files in this corpus are also contained in 
CSLU: 22 Languages Corpus, LDC2005S26. 

*

(2)  English Gigaword Third Edition 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T07> 
is a comprehensive archive of newswire text data that has been acquired 
over several years by the LDC.  This edition includes all of the 
contents in the second edition (LDC2005T12) as well as new data from the 
same five sources presented there covering 24-month period of January 
2005 through December 2006. Also, a sixth data source (the Los Angeles 
Times/Washington Post newswire service) has been added in the third 
edition.

The six distinct international sources of English newswire included in 
this edition are the following:

    * Agence France-Presse, English Service (afp_eng)
    * Associated Press Worldstream, English Service (apw_eng)
    * Central News Agency of Taiwan, English Service (cna_eng)
    * Los Angeles Times/Washington Post Newswire Service (ltw_eng)
    * New York Times Newswire Service (nyt_eng)
    * Xinhua News Agency, English Service (xin_eng)

As with other Gigaword releases, some of the content in the this corpus 
has been published previously by the LDC in a variety of other corpora, 
particularly the North American News text corpora, the various TDT 
corpora, and the AQUAINT text corpus, as well as earlier editions of 
English Gigaword.

*

(3)  The OntoNotes project is a collaborative effort between BBN 
Technologies, the University of Colorado, the University of 
Pennsylvania, and the University of Southern California's Information 
Sciences Institute.  It aims to annotate a large corpus comprising 
various genres of text (news, conversational telephone speech, weblogs, 
use net, broadcast, talk shows) in three languages (English, Chinese, 
and Arabic) with structural information (syntax and predicate argument 
structure) and shallow semantics (word sense linked to an ontology and 
coreference). OntoNotes builds on two time-tested resources, following 
the Penn Treebank for syntax and the Penn PropBank for 
predicate-argument structure. Its semantic representation includes word 
sense disambiguation for nouns and verbs, with each word sense connected 
to an ontology, and coreference. The goals call for annotation of over a 
million words each of English and Chinese, and half a million words of 
Arabic over five years. 

The current release, OntoNotes v 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T21>, 
contains over 300K words of annotated English data drawn from the 
LDC95T7 Treebank-2 Wall Street Journal corpus and 250K words of 
annotated Chinese data drawn from Xinhua News Agency and Sinorama Magazine

The normative version of OntoNotes v 1.0 is a relational database, for 
which the various layers of annotation for both the English and Chinese 
corpora are merged.  It was created by loading the separate Treebank, 
PropBank, word sense, and coreference sources and merging them into a 
set of linked relational database tables.  The source files for each of 
the five layers of annotation (syntactic structure, propositional 
structure, word sense, coreference, and names) are included in the data 
directory, using separate files for each layer of annotation of each 
corpus document files. 

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------

*
Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                  http://www.ldc.upenn.edu*

-

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070529/5133ac72/attachment.htm>