Dave Wilton
Fri Nov 13 15:55:05 UTC 2009

Sonnet 18 is not a representative sample. It is very small. It is by a
single writer. It is a single work. It is writing of a different era. You
cannot extrapolate from this extremely limited sample (I would not call it a
"corpus," which implies a degree of comprehensiveness that a single sonnet
lacks) to draw conclusions about the language as a whole.

Any small sample will invariably have a larger number of words from OE.
Since most of our basic vocabulary--the most common 2,000 words or so are
from OE, the percentage of words from OE will drop as the sample grows. Only
30% of the words in the sonnet are repeated. As the sample grows, the number
of repeated words will also grow. The following figures are taken from
Christopher Cannon's "The Making of Chaucer's English" (Cambridge Univ
Press, 1998). The "Epilogue to the Nun's Priest's Tale is 120 words long
(about the same as Sonnet 18). 88 of these words are "unique" (defined as
having a headword in the MED). This is about the same for the sonnet. But
take Chaucer's "Troilus and Chriseyde," at 65,625 words total, it has only
3,612 unique words, or a mere 5.5%.

Unfortunately, Cannon does not give figures for OE v. OF origins (but he
does include a complete glossary of Chaucer's corpus with origins marked, so
you can tally them up if you want to spend the time). I did a similar study
of Thomas Hoccleve's (a 15th century protégé of Chaucer) "Complaint" and
"Dialogue" and found:

"Complaint"      3,200 words     88% Germanic;    803 unique words      67%
"Dialogue"       6,394 words     87% Germanic;    1,205 unique words    58%

(I did not distinguish between OE and later borrowings from Dutch and other
languages, but the overwhelming majority of the Germanic words were from OE.
Similarly, the Romance words include ones borrowed directly from Latin, but
most are OF or AF.)

Looking at the words repeated the most in the two poems shows the prevalence
of OE origins for the most common words. The 55 most repeated words in the
poem all have Germanic roots. It wasn't until the 56th most common word that
I found one with an OF root. The next OF root came in at 91 on the list of
most common words. But we can see in the Hoccleve samples, as the size of
the sample grows, and we start diluting the influence of the most common
words, the percentage of words with a Romance origin also grows.

This also gets us into the writings of a single writer. Hoccleve is not
known for his aureate or Latinate diction, and we should expect a rather
high percentage of Germanic words in his writing. A different 15th century
writer, Lydgate for instance, should have different results and more Romance
words in his corpus. You need to look at a corpus comprising many writers
before drawing general conclusions. You also need to examine multiple
genres. Poetry is one thing, but the diction will be different for
scientific and technical works, political speeches, friendly letters, etc.

You also need to examine multiple works. What was Shakespeare trying to
accomplish with Sonnet 18? His diction here is very particular, very simple,
and as a result very Germanic. He is writing to achieve a particular effect,
and this affects his diction. Is he, for example, rebelling against the
prevalence of "inkhorn" terms in Elizabethan poetry and demonstrating
virtuosity with simple words? And era makes a difference. Chaucer and
Hoccleve, which I chose because the numbers were at hand, are not good
examples for how we write and speak today, nor is Elizabethan English. If
you want to draw conclusions about the language as she is spoken today, you
need a large and broad corpus of 21st century works.

In the wake of the recent thread on this list to do with the percentage of
the English vocabulary which is "native" (so to speak) and that which is
borrowed, I determined to explore this question by examining a
representative corpus of English.  The results of this study were somewhat
startling, to say the least.

It emerged that 80% of English words are native in origin, with the
remaining 20% being taken entirely from Old French.  Furthermore, the Old
French borrowings are found in a narrowly restricted period of one hundred
years [N1].  Words borrowed from Latin and Greek were conspicuously absent

The corpus of English chosen for study encompasses the entirety of
Shakespeare's Sonnet 18, and identification of the origins of the words
found there was performed by reference to The Oxford English Dictionary,
online edition [N3].

 The Corpus of Lexical Items:

Shall I compare thee to a Summers day?
Thou art more louely and more temperate:
Rough windes do shake the darling buds of Maie,
And Sommers lease hath all too short a date:
Sometime too hot the eye of heauen shines,
And often is his gold complexion dimm'd,
And euery faire from faire some-time declines,
By chance, or natures changing course vntrim'd:
But thy eternall Sommer shall not fade,
Nor loose possession of that faire thou ow'st,
Nor shall death brag thou wandr'st in his shade,
When in eternall lines to time thou grow'st,
So long as men can breath or eyes can see,
So long liues this, and this giues life to thee.

SHAKE-SPEARES SONNETS published by John Thorpe in 1609.
© 1995, 1998, 2004 Hardy M. Cook and Ian Lancashire

The actual percentage figures, rounded up or down to whole numbers, are:


            Total                   Unique [$1]         Important [$2]

             114                          81                          60

OE     100 / 88%             68 / 84%                 47 / 78% [N4]
OF     14  /  12%             13 / 16%                 13 / 22%

[$1] - eliminating repetitions and plural forms of the same word.

[$2] - eliminating conjunctions, prepositions and pronouns to leave nouns,
verbs, adjectives, etc.

(In order to clarify these figures, words noted by the OED as either Old or
Middle English in origin, without a foreign source, are subsumed under the
general rubric, OE.  Similarly, the Old French, Anglo-French, and French of
the OED all huddle together here as OF.)

I leave it to others to determine why the "myth" of English borrowing from
Latin, and indeed any language other than Old French, has persisted
unchallenged for so long.  It is sufficient for me to have done my small
part in dispelling the fug of disinformation which has persistently
beclouded the study of this aspect of English lexicography.

        Caveats and Further Considerations

To be representative of the full historical range of English, this study
should of course be extended to take in a wider range of corpora from
periods other than the early seventeenth century.  It is for this reason
that the author of this paper will be actively seeking financial support in
order to extend his conclusions to encompass the sixteenth century (Thomas
Wyatt, "Farewell love, and all thy laws forever" [*N1]), the later
seventeenth century (Milton on his late departed saint), the nineteenth
century (Wordsworth on Westminster Bridge), and the twentieth century
(Rupert Brooke, "The Soldier" - 'If I should die, think only this of me'

The eighteenth century is unfortunately barren of suitable texts, no sonnets
having been committed to writing throughout this period, and thus must be
considered a _locus incognitus lexicalium_.  This apparent deficiency in the
scope of the study will be countered by a consideration of the first of
Elizabeth Barrett Browning's _Sonnets from the Portuguese_ in order to
determine whether there are any gender-specific aspects to English lexical
borrowing, and Edwin Morgan's _50 Renaissance Sonnets_ [translated], _Ten
Glasgow Sonnets_, and the complete sequence of _Sonnets From Scotland_,
these texts providing an insight into the lexical underpinnings of the
author's native land.

A wider exploration of the nature of English lexical borrowings would entail
reference to the recently published _Historical Thesaurus of English_ (OUP)
in order to establish whether the borrowed terms either (a) replaced
already-present native English terms or (b) extended the semantic scope of
the language.  As this resource is not available online, the author was
unable to consult it, and leaves such a consideration to future generations
of scholars who, unlike him, will be in receipt of proper financial support
for their lexicographical endeavours.


[N1]  To be precise, the borrowings fall entirely within the chronological
range 1275-1386.  As this period begins roughly 200 years after the
Anglo-Norman assumption of power in England, and ends 223 years before the
publication of Shakespeare's Sonnets, it would seem likely that these
borrowings are the result of a long and sustained campaign by the English
government to transform the language into a condition suitable for the
publication of Parliamentary Statutes in English rather than Anglo-Norman,
an event which occurred in the course of the reign of Henry VII.  While
proffered merely as a hypothesis, such a conclusion would seem to be
consonant with the material here presented.

[N2]  This absence should be qualified by the observation that the majority
of the Old French words found in the lexical corpus examined are themselves
derived from Latin.  However, as the preponderance of Old English words in
the corpus are similarly derived from Common Germanic, or have Germanic
cognates, it seems legitimate to disregard this aspect of Secondary Latin
Borrowing.  A proper consideration of this aspect of the materials could be
conducted by means of a recourse to Pokorny's _Indogermanisches
etymologisches Woerterbuch_

[N3]  It might be argued that a consultation of the Ann Arbor Dictionary of
Medieval English [] and the Dictionary of
the Older Scottish Tongue [] would have been
productive.  This, due to constraints of time and failing eyesight, the
author was unable to undertake.

[N4]  In the case of one of these OE words, "of", there is a degree of
ambiguity in that, while the word itself is native, the sense in which it is
used in the corpus derives from OF.  Regardless of the way in which we
choose to consider the term "of", however, it does not markedly skew the
overall percentage figures.

             Notes to Caveats and Further Considerations:

*N1  In treating this sonnet by Wyatt, the text in the Egerton manuscript
will of course be collated with that found in the Devonshire MS and
_Tottel's Miscellany_.  Reference to the Arundel MS would seem to be
nugatory, since this derives directly from Egerton, and, as is well known,
the Blage MS is silent with regard to this particular text.

*N2  It might be argued that the choice of that sonnet is vitiated by
Brooke's clear dependence on an earlier text by Thomas Hardy, "Drummer
Hodge".  Please to note that the author of this study has taken this issue
into account, but has chosen, for reasons sufficient unto himself, to
disregard it.


(M.A. [Glasgow]; D.Phil. [York]),
Professor Emeritus
University of 'Pataphysics, Cockaigne.

