[Corpora-List] A question about placement of notes in linguistically annotated corpora of Early Modern texts
Martin Mueller
martin.mueller at mac.com
Sun Jan 20 16:31:02 UTC 2013
Phil Burns from Northwestern's IT group and I are working on a project to
provide linguistic annotation for some 40,000 texts published between 1473
and 1700 and transcribed by the EEBO-TCP project. Currently, all these
texts are available only to the members of institutions that have
subscribed to them. But in 2015, some 25,000 texts will pass into the
public domain, and over the following five years another 45,000 texts will
follow them. Thus students of Early Modern English can look forward to a
environment that will soon provide them with access anywhere anytime to a
rich set of carefully encoded data from the first 250 years of English
print culture.
A much smaller set of ~2,000 18th-century texts from the ECCO-TCP project
has already been released into the public domain, and we expect to provide
linguistically annotated versions of these texts at some point in the
spring or early summer.
If potential users of these data sets have advice to offer, we would very
much like to hear it, and I would like to seek your advice on a particular
question. First a few remarks about the encoding of these texts. They
were encoded in a modified of TEI P3 that will be transformed to TEI P5 in
the course of our work. The encoding is light but consistent and allows
you to exclude or focus on words that occur in paragraphs, lines of verse,
epigraphs, notes, list and tables, speaker labels, epigraphs, opening and
closing phrases of correspondence,and a few others. The linguistic
annotation will be "element-aware" in the sense that different rules,
probability tables, and supporting lexica will be used for stuff that is
likely to be special, such as lines of verse, stage directions, or notes.
My particular question has to do with the encoding of notes, stuff put
inside <note: elements. Early modern prose is full of notes. In the print
originals they occur sometimes at the foot of page, but the great majority
of them are marginal notes ( and they often are summaries rather than
notes in a modern sense of the word). In the TCP transcriptions, foot
notes and marginal notes are encoded inline. Footnotes are placed where
their markers occur. Marginal notes are put where they fit best, following
broad rules but leaving discretion to the transcribers. Here is a typical
example from A Defence of the Catholyke Cause (1602):
<P>IT is now more then three yeres, gentle reader, since that one Edward
Squyre,<NOTE PLACE="marg">Edvvard Squyre executed for a fayned conspiracy,
and the author of this treatyse charge therevvith.</NOTE> hauing bin
sometyme prisoner in Spayne, and escaping thence into England, was
condemned and executed for a fayned conspiracy against her Maiestyes
person, wherto my self & some others were charged to be priuy; &
for as much as it seemed to mee that this fraudulent manner of our
aduersaries proceeding against Catholykes, by way of slanders and
diffamations, authorised with shew of publik Iustice,<NOTE
PLACE="marg">The reasons that moued the author to vvryte an Apology in his
ovvne defence.</NOTE> and continued now many yeres, did beginne to redound
not only to the vndeserued disgrace, & discredit of particular men
wrongfully accused, but also to the dishonour of our whole cause, I
thought it co~uenie~t to write an Apology in my defe~ce, & to dedicate
the same to the Lords of her Maiesties priuy counsel, as wel to cleare my
self to their honours of the cryme falsly imputed vnto mee, as also to
discouer vnto them the treacherous dealing of such as abuse her Maiesties
autority and theirs in this behalf, to the spilling of much innocent
blood, with no smalle blemish to her Maiesties gouernment, and the assured
exposition of the whole state, to the wrath of God, if it be not remedied
in tyme.</P>
MorphAdorner, Phil Burns' software, treats such <note> elements as "jump
tags", treats their content separately, and "knows" about the reading
order of the main text. We have two choices for for dealing with <note>
elements. We could leave them where they are, or we could gather them in
separate <div> elements, leaving sone form of marker at the original
location of their encoding. That procedure would be reversible, and it
could also be separately implemented by anybody manipulating the texts. So
in some ways the question does not matter very much.
But from the OWL perspective (Piotr Banski's lovely term for "ordinary
working linguist"), which choice would provide the better default setting
and be more in keeping with practices elsewhere and the expectations of
scholars who may work with those text? Notice that this question has
nothing to do with the way in which notes would be displayed in a
browser-based rendering of the texts. It is a question about which choice
would on balance provide an easier or more profitable working environment.
My own view so far has been that there would be some advantages in
grouping notes separately. It would make it a little easier to attend to
notes as a genre in their own right, it would make it a little easier to
process the main text because you wouldn't have to worry about stuff that
interrupts the reading order, and from a philological perspective you
could argue that wherever the notes were placed in the original, they
certainly were not placed in the middle of the text. But I'm not very
confident about my hunches in this regard, and if there is a consensus
"out there" about best practices I would much rather follow that than my
own nose.
I would welcome your advice, online or offline, on this topic as well as
any information about the practices of comparable enterprises elsewhere.
With thanks in advance
Martin Mueller
Professor emeritus
Department of English and Classics
Northwestern University
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list