Corpora: Penn-Helsinki Parsed Corpus of Middle English

Anthony Kroch kroch at linc.cis.upenn.edu
Fri May 19 14:42:08 UTC 2000


This might be of interest to others besides the questioner.

------- Forwarded Message

Date:    Fri, 19 May 2000 10:26:45 -0400
From:    kroch at change.ling.upenn.edu
To:      G.Rundblad at uea.ac.uk
Subject: Corpora: XML programmes and tagging


Hello Gabriella Rundblad,

My name is Anthony Kroch and I am on the faculty of the Linguistics Department
at the University of Pennsylvania. As it happens, I and my colleagues have
already created a corpus of the sort that you are talking about. It is a
parsed version of the prose text samples in the Helsinki Corpus of Historical
English and is called the "Penn-Helsinki Parsed Corpus of Middle English." The
first edition, which is five years old or so, has total of 500,000 words of
running text and marks clause and phrase structure without indicating part of
speech. A second edition will be released at the end of the month. This new
edition contains 1.3 million words and was created by increasing the size of
the Helsinki samples (to a maximum 50,000 words when the text was long
enough). The second edition also has a richer annotation system and includes
part-of-speech tagging. The first edition comes with Perl scripts to
facilitate searching and the second edition comes with a specially written
Java program for this purpose.

You can get more information about the corpora from the PPCME web site:
http://www.ling.upenn.edu/mideng

The corpora are not currently in XML format but part of our plan for the
future is to perform that conversion, which can be done automatically for the
most part. We are currently creating a corpus of early Modern English, using
the same annotation guidelines as those of the PPCME2. At the University of
York in England, Prof. Anthony Warner is directing a project to create a
corpus of Old English along the same lines.

Please feel free to contact me if you have any questions about the corpora.

Yours,

Anthony Kroch
Professor and Chair
Department of Linguistics
University of Pennsylvania
Philadelphia, PA 19104-6305
USA


>From: Gabriella Rundblad <G.Rundblad at uea.ac.uk>
>To: CORPORA at hd.uib.no
>Subject: Corpora: XML programmes and tagging
>
>
>Dear all,
>
>Despite having used language corpora for some years, I've
>never put together my own corpus. Until now.
>
>I'm considering putting together a corpus of Middle English
>using already electronically available text, but tagging it
>to enable searches. I shall be attending the Oxford summer
>seminars on digital resources etc. to learn more, but would
>like to address some of the issues already now and perhaps
>do some tests to see if my idea is plausible at all.
>
>
>1) As far as I understand, it is today recommended to use
>XML for tagging purposes. For this I'll need user-friendly
>programme(s), the question is which. I know there are both
>free ware, share ware and commercial products out there,
>though I've never tried (yet) either of them and don't
>know how user-friendly they are. I know HTML and use
>Hotmetal Pro for this (great!) and there is obviously an
>XML equivalent (XMetal). Could you advice what programme(s)
>to use?! Is XMetal good for a never-before-tagger?!
>
>2) The tagging I would like to do (I'm reading up on TEI
>etc) is a tagging of phrases and clauses, not parts of
>speech. What's been done on this earlier? Any lists of tags
>etc?
>
>
>Grateful for all the advice you can offer.
>
>
>Gabriella Rundblad
>
>
>University of East Anglia
>School of Language, Linguistics and Translation Studies
>Norwich NR4 7TJ
>UK
>

------- End of Forwarded Message



More information about the Corpora mailing list