Announcing the second edition of the Penn-Helsinki Parsed Corpus of Middle English
Tony Kroch
kroch at change.ling.upenn.edu
Thu Sep 14 01:33:29 UTC 2000
----------------------------Original message----------------------------
The second edition of the Penn-Helsinki Parsed Corpus of Middle English
(PPCME2) is now publicly available under the conditions outlined
below. It consists of 55 text samples containing 1.3 million words of
syntactically annotated Middle English prose and ranging over four
time periods, from 1150 to 1500.
Like the first edition of the PPCME, the PPCME2 is based on the
Middle English portion of the Helsinki Corpus of English Texts that
was created at the University of Helsinki under the direction of
Matti Rissanen and Ossi Ihalainen. The size of the text samples in
the second edition has been enlarged so that the total corpus size is
nearly three times larger. In addition, the corpus is now tagged for
part of speech and the syntactic annotation system is richer.
For the earliest time period, all texts except one are complete; the
exception is the Ancrene Riwle sample, which contains approximately
50,000 words. For the later time periods, two texts per time period
were expanded
to approximately 50,000 words. The remaining texts are represented
by the Helsinki Corpus sample.
The PPCME2 is being distributed on a CD-ROM that includes several files
for each text in the corpus:
- a file with unannotated text
- a file with philological and other information about the text
(manuscript and edition used, date, dialect, genre, and word count
of the sample)
- a file in which individual words are tagged for part of speech
- a file that is annotated for syntactic structure
Available with the corpus is CorpusSearch, a Java program written by
Beth Randall that runs under Unix, Linux, MacOS and Windows.
CorpusSearch uses standard syntactic predicates like ``(immediately)
precedes'', ``(immediately) dominates'', and Boolean combinations
thereof, and it allows outputs of previous search as inputs to
further searches.
To order the PPCME2, please go to http://www.ling.upenn.edu/mideng and
follow the instructions there.
The cost of a subscription to the corpus is $200 and the cost of a
license for CorpusSearch is $50. The items may be purchased together
or separately. Proceeds from the sale of the corpus will pay for
improving the corpus and for increasing its size over time. Proceeds
from the sale of CorpusSearch will go to the author.
The PPCME2 was designed and built by Anthony Kroch and Ann Taylor at the
University of Pennsylvania. Supplementary assistance was provided by
Beatrice Santorini. The PPCME2 is part of of a larger project to produce a
parsed diachronic corpus of English from 800 to 1800. The Old English part
is under construction at York under the direction of Anthony Warner, Susan
Pintzuk, and Ann Taylor and the Early Modern English part is under
construction at the University of Pennsylvania under the direction of Kroch
and Santorini.
More information about the Histling
mailing list