Corpora: New Corpus

LDC Office ldc at
Wed Mar 22 22:40:30 UTC 2000

BLLIP 1987-89 WSJ Corpus Release 1

LDC is pleased to announce the availability of a new
corpus from the Brown Laboratory for Linguistic
Information Processing (BLLIP):

  The 1987-89 Wall Street Journal (WSJ) Corpus Release 1.

This two CD-ROM corpus contains a complete,
Treebank-style parsing of the three-year WSJ archive
from the ACL/DCI corpus -- about 30 million words of
text.  The parsing and part-of-speech (POS) tagging
for the entire archive were done using
statistically-based methods developed by Eugene
Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale
and Mark Johnson of BLLIP.

This corpus both overlaps and supplements the
1-million-word Penn Treebank collection of parsed and
POS-tagged WSJ texts.

Institutions that have membership in the LDC during
the 2000 Membership Year will be able to receive this
corpus free of charge.  Nonmembers may purchase the
BLLIP 1987-89 WSJ Corpus Release 1 for $100.  All
organizations who wish to receive this corpus must sign
the BLLIP 1987-89 WSJ Corpus Release 1 license agreement,
which can be retrieved from:

If you would like to order a copy of this corpus,
please email your request to <ldc at>. If
you need additional information before placing your
order, or would like to inquire about membership in
the LDC, please send email or call (215) 898-0464.

More information about the Corpora mailing list