Corpora: Listing of historical corpora (besides ICAME and Spanish)

Mark Davies mdavies at
Wed May 24 18:57:50 UTC 2000

I'm trying to create a listing of HISTORICAL corpora for languages besides 
Spanish (I already have that), and in addition to those on the ICAME 
CD-ROM, which includes The Helsinki Corpus of English Texts, The Helsinki 
Corpus of Older Scots, Corpus of Early English Correspondence, The 
Newdigate Newsletters, Lampeter Corpus, Innsbruck Computer-Archive of 
Machine-Readable English Texts (ICAMET) [see]

Here's a listing of what I have so far:

Language / Name / URL / Approx. time period / Approx. size

1) English / Penn-Helsinki Parsed Corpus of Middle English / 1150-1500 / 1,200,000 words [based on the 
Helsinki corpus]

2) English / Penn-Helsinki Parsed Corpus of Old English/ Info at / 850-1150 / 420,000 words 
[based on the Helsinki corpus]

3) French / ARTFL (Trésor de la langue française) / / 1600 > / 
115,000,000 words

4) Swedish / Projektet Källtext / ???? / 
2,000,000 words

5) German / Projekt Gutenberg / / 
Mostly 1900s, but a few earlier / 300 texts (# words ??)

6) Portuguese / Tycho Brahe Parsed Corpus of Historical Portuguese / / c1600-1900 / Goal of 1,000,000 words

7) Chinese / Historical Corpora for Synchronic and Diachronic Linguistics 
Studies / / Pre-Qin to 
Chang dynasties (time period??) / 17,000,000 characters

As can be seen, I haven't yet identified many HISTORICAL corpora for 
German, Dutch, Norwegian, Icelandic, Italian, Romanian, Hungarian, Finnish, 
any of the Slavic languages, or any of the other European languages.  In 
addition the only non-European language for which I can find anything is 
Chinese. (Also, I know that there are/must be nice collections of classical 
Greek and Latin in electronic form and on the Web [due to the large number 
of classical texts] but I haven't compiled a list of these yet).

At any rate, if anyone has information on other historical corpora for the 
desired languages, I'd appreciate your sending me a URL for the 
resources.  I will be creating a webpage with links to the historical 
corpora and will announce this on CORPORA in about a week, when I've 
received feedback from others.

Thanks in advance for your help.

Mark Davies

Mark Davies, Associate Professor, Spanish Linguistics
Dept. of Foreign Languages, Illinois State University
Normal, IL 61790-4300

Voice:309/438-7975      email:mdavies at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Corpora mailing list