[Corpora-List] New LDC Corpus
LDC Office
ldc at ldc.upenn.edu
Thu Jan 30 21:58:24 UTC 2003
* English Gigaword *
The Linguistic Data Consortium (LDC) is pleased to announce the
availability of the English Gigaword corpus.
English Gigaword is a comprehensive archive of newswire text data
in English that has been acquired over several years by the LDC. The
newswire texts are drawn from four international sources:
Agence France Press English Service
Associated Press Worldstream English Service
The New York Times Newswire Service
The Xinhua News Agency English Service
English Gigaword is the first LDC publication to be distributed on
DVD. Much of the content in this collection has been published
previously by the LDC in a variety of other, older corpora,
particularly, the North American News text corpora (LDC95T21, LDC98T30),
the various TDT corpora and the AQUAINT text corpus (LDC2002T31). In
addition to this previously published data, the English Gigaword corpus
contains a significant amount of previously unreleased data,
specifically, all of the Agence France Presse content, the 1995 and
2001 Xinhua content, and portions of NYT and APW dating from February
2001 forward.
All text data are presented in SGML form, using a very simple, minimal
markup structure; all text consists of printable ASCII and whitespace.
The text formatting is consistent across all sources. The English
Gigaword corpus has been fully validated by a standard SGML parser
utility (nsgmls), using a DTD file which is provided as part of this
publication.
For further information, including a link to online documentation,
please visit:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05
Institutions that have membership in the LDC during the 2003
Membership Year will be able to receive this corpus free of charge.
Nonmembers may license this publication for $2,500.
*
If you need additional information before placing your order, or
would like to inquire about membership in the LDC, please send email to
<ldc at ldc.upenn.edu> or call (215) 573-1275.
---------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
3600 Market Street Fax: (215) 573-2175
Suite 810 email: ldc at ldc.upenn.edu
Philadelphia, PA 19104-2653 www: http://www.ldc.upenn.edu
More information about the Corpora
mailing list