14.321, FYI: LSA Bulletin, New LDC Corpus

LINGUIST List linguist at linguistlist.org
Fri Jan 31 19:53:46 UTC 2003


LINGUIST List:  Vol-14-321. Fri Jan 31 2003. ISSN: 1068-4875.

Subject: 14.321, FYI: LSA Bulletin, New LDC Corpus

Moderators: Anthony Aristar, Wayne State U.<aristar at linguistlist.org>
            Helen Dry, Eastern Michigan U. <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org):
	Simin Karimi, U. of Arizona
	Terence Langendoen, U. of Arizona

Home Page:  http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: James Yuells <james at linguistlist.org>

=================================Directory=================================

1)
Date:  Wed, 29 Jan 2003 11:14:48 -0500
From:  LSA <lsa at lsadc.org>
Subject:  LSA Bulletin

2)
Date:  Thu, 30 Jan 2003 17:00:21 -0500
From:  LDC Office <ldc at ldc.upenn.edu>
Subject:  New LDC Corpus

-------------------------------- Message 1 -------------------------------

Date:  Wed, 29 Jan 2003 11:14:48 -0500
From:  LSA <lsa at lsadc.org>
Subject:  LSA Bulletin

The December 2002 issue of the LSA Bulletin is now available at the
Linguistic Society of America website: http://www.lsadc.org.


-------------------------------- Message 2 -------------------------------

Date:  Thu, 30 Jan 2003 17:00:21 -0500
From:  LDC Office <ldc at ldc.upenn.edu>
Subject:  New LDC Corpus


 		  *   English Gigaword   *


The Linguistic Data Consortium (LDC) is pleased to announce the
availability of the English Gigaword corpus.

English Gigaword is a comprehensive archive of newswire text data
in English that has been acquired over several years by the LDC. The
newswire texts are drawn from four international sources:

Agence France Press English Service
Associated Press Worldstream English Service
The New York Times Newswire Service
The Xinhua News Agency English Service

English Gigaword is the first LDC publication to be distributed on
DVD.  Much of the content in this collection has been published
previously by the LDC in a variety of other, older corpora,
particularly, the North American News text corpora (LDC95T21, LDC98T30),
the various TDT corpora and the AQUAINT text corpus (LDC2002T31). In
addition to this previously published data, the English Gigaword corpus
contains a significant amount of previously unreleased data,
specifically, all of the Agence France Presse content, the 1995 and
2001 Xinhua content, and portions of NYT and APW dating from February
2001 forward.

All text data are presented in SGML form, using a very simple, minimal
markup structure; all text consists of printable ASCII and whitespace.
The text formatting is consistent across all sources.  The English
Gigaword corpus has been fully validated by a standard SGML parser
utility (nsgmls), using a DTD file which is provided as part of this
publication.

For further information, including a link to online documentation,
please visit:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05

Institutions that have membership in the LDC during the 2003
Membership Year will be able to receive this corpus free of charge.
Nonmembers may license this publication for $2,500.

			   *

If you need additional information before placing your order, or
would like to inquire about membership in the LDC, please send email to
<ldc at ldc.upenn.edu> or call (215) 573-1275.


- -------------------------------------------------------------------
Linguistic Data Consortium          Phone: (215) 573-1275
3600 Market Street                  Fax:   (215) 573-2175
Suite 810                           email: ldc at ldc.upenn.edu
Philadelphia, PA 19104-2653         www: http://www.ldc.upenn.edu

---------------------------------------------------------------------------
LINGUIST List: Vol-14-321



More information about the LINGUIST mailing list