14.321, FYI: LSA Bulletin, New LDC Corpus
LINGUIST List
linguist at linguistlist.org
Fri Jan 31 19:53:46 UTC 2003
LINGUIST List: Vol-14-321. Fri Jan 31 2003. ISSN: 1068-4875.
Subject: 14.321, FYI: LSA Bulletin, New LDC Corpus
Moderators: Anthony Aristar, Wayne State U.<aristar at linguistlist.org>
Helen Dry, Eastern Michigan U. <hdry at linguistlist.org>
Reviews (reviews at linguistlist.org):
Simin Karimi, U. of Arizona
Terence Langendoen, U. of Arizona
Home Page: http://linguistlist.org/
The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.
Editor for this issue: James Yuells <james at linguistlist.org>
=================================Directory=================================
1)
Date: Wed, 29 Jan 2003 11:14:48 -0500
From: LSA <lsa at lsadc.org>
Subject: LSA Bulletin
2)
Date: Thu, 30 Jan 2003 17:00:21 -0500
From: LDC Office <ldc at ldc.upenn.edu>
Subject: New LDC Corpus
-------------------------------- Message 1 -------------------------------
Date: Wed, 29 Jan 2003 11:14:48 -0500
From: LSA <lsa at lsadc.org>
Subject: LSA Bulletin
The December 2002 issue of the LSA Bulletin is now available at the
Linguistic Society of America website: http://www.lsadc.org.
-------------------------------- Message 2 -------------------------------
Date: Thu, 30 Jan 2003 17:00:21 -0500
From: LDC Office <ldc at ldc.upenn.edu>
Subject: New LDC Corpus
* English Gigaword *
The Linguistic Data Consortium (LDC) is pleased to announce the
availability of the English Gigaword corpus.
English Gigaword is a comprehensive archive of newswire text data
in English that has been acquired over several years by the LDC. The
newswire texts are drawn from four international sources:
Agence France Press English Service
Associated Press Worldstream English Service
The New York Times Newswire Service
The Xinhua News Agency English Service
English Gigaword is the first LDC publication to be distributed on
DVD. Much of the content in this collection has been published
previously by the LDC in a variety of other, older corpora,
particularly, the North American News text corpora (LDC95T21, LDC98T30),
the various TDT corpora and the AQUAINT text corpus (LDC2002T31). In
addition to this previously published data, the English Gigaword corpus
contains a significant amount of previously unreleased data,
specifically, all of the Agence France Presse content, the 1995 and
2001 Xinhua content, and portions of NYT and APW dating from February
2001 forward.
All text data are presented in SGML form, using a very simple, minimal
markup structure; all text consists of printable ASCII and whitespace.
The text formatting is consistent across all sources. The English
Gigaword corpus has been fully validated by a standard SGML parser
utility (nsgmls), using a DTD file which is provided as part of this
publication.
For further information, including a link to online documentation,
please visit:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05
Institutions that have membership in the LDC during the 2003
Membership Year will be able to receive this corpus free of charge.
Nonmembers may license this publication for $2,500.
*
If you need additional information before placing your order, or
would like to inquire about membership in the LDC, please send email to
<ldc at ldc.upenn.edu> or call (215) 573-1275.
- -------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
3600 Market Street Fax: (215) 573-2175
Suite 810 email: ldc at ldc.upenn.edu
Philadelphia, PA 19104-2653 www: http://www.ldc.upenn.edu
---------------------------------------------------------------------------
LINGUIST List: Vol-14-321
More information about the LINGUIST
mailing list