[Corpora-List] New from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Tue May 29 17:39:25 UTC 2007
The Linguistic Data Consortium (LDC) would like to announce the
availability of three new publications.*
*
LDC2007S08
*-CSLU:Foreign Accented English Release 1.2-*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S08>
LDC2007T07
*-English Gigaword Third Edition-*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T07>
LDC2007T21
*-OntoNotes v 1.0-*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T21>
------------------------------------------------------------------------
*
New Publications *
*
*
(1)* *CSLU: Foreign Accented English Release 1.2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S08>
consists of continuous speech in English by native speakers of 22
different languages: Arabic, Cantonese, Czech, Farsi, French, German,
Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Mandarin
Chinese, Malay, Polish, Portuguese (Brazilian and Iberian), Russian,
Swedish, Spanish, Swahili, Tamil and Vietnamese. The corpus contains
4925 telephone-quality utterances, as well as information about the
speakers' linguistic backgrounds and perceptual judgments about the
accents in the utterances.
For this collection, the speakers were asked to speak about themselves
in English for 20 seconds. Three native speakers of American English
independently listened to each utterance and judged the speakers'
accents on a 4-point scale: negligible/no accent, mild accent, strong
accent and very strong accent. CSLU: Foreign Accented English Release
1.2 corpus is intended to support the study of the underlying
characteristics of foreign accent and to enable research, development
and evaluation of algorithms for the identification and understanding of
accented speech. Some of the files in this corpus are also contained in
CSLU: 22 Languages Corpus, LDC2005S26.
*
(2) English Gigaword Third Edition
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T07>
is a comprehensive archive of newswire text data that has been acquired
over several years by the LDC. This edition includes all of the
contents in the second edition (LDC2005T12) as well as new data from the
same five sources presented there covering 24-month period of January
2005 through December 2006. Also, a sixth data source (the Los Angeles
Times/Washington Post newswire service) has been added in the third
edition.
The six distinct international sources of English newswire included in
this edition are the following:
* Agence France-Presse, English Service (afp_eng)
* Associated Press Worldstream, English Service (apw_eng)
* Central News Agency of Taiwan, English Service (cna_eng)
* Los Angeles Times/Washington Post Newswire Service (ltw_eng)
* New York Times Newswire Service (nyt_eng)
* Xinhua News Agency, English Service (xin_eng)
As with other Gigaword releases, some of the content in the this corpus
has been published previously by the LDC in a variety of other corpora,
particularly the North American News text corpora, the various TDT
corpora, and the AQUAINT text corpus, as well as earlier editions of
English Gigaword.
*
(3) The OntoNotes project is a collaborative effort between BBN
Technologies, the University of Colorado, the University of
Pennsylvania, and the University of Southern California's Information
Sciences Institute. It aims to annotate a large corpus comprising
various genres of text (news, conversational telephone speech, weblogs,
use net, broadcast, talk shows) in three languages (English, Chinese,
and Arabic) with structural information (syntax and predicate argument
structure) and shallow semantics (word sense linked to an ontology and
coreference). OntoNotes builds on two time-tested resources, following
the Penn Treebank for syntax and the Penn PropBank for
predicate-argument structure. Its semantic representation includes word
sense disambiguation for nouns and verbs, with each word sense connected
to an ontology, and coreference. The goals call for annotation of over a
million words each of English and Chinese, and half a million words of
Arabic over five years.
The current release, OntoNotes v 1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T21>,
contains over 300K words of annotated English data drawn from the
LDC95T7 Treebank-2 Wall Street Journal corpus and 250K words of
annotated Chinese data drawn from Xinhua News Agency and Sinorama Magazine
The normative version of OntoNotes v 1.0 is a relational database, for
which the various layers of annotation for both the English and Chinese
corpora are merged. It was created by loading the separate Treebank,
PropBank, word sense, and coreference sources and merging them into a
set of linked relational database tables. The source files for each of
the five layers of annotation (syntactic structure, propositional
structure, word sense, coreference, and names) are included in the data
directory, using separate files for each layer of annotation of each
corpus document files.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
*
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu*
-
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070529/5133ac72/attachment.htm>
More information about the Corpora
mailing list