23.2276, FYI: GerManC Corpus is Now Available

linguist at linguistlist.org linguist at linguistlist.org
Fri May 11 16:10:57 UTC 2012


LINGUIST List: Vol-23-2276. Fri May 11 2012. ISSN: 1069 - 4875.

Subject: 23.2276, FYI: GerManC Corpus is Now Available

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews: Veronika Drake, U of Wisconsin-Madison
Monica Macaulay, U of Wisconsin-Madison
Rajiv Rao, U of Wisconsin-Madison
Joseph Salmons, U of Wisconsin-Madison
Anja Wanner, U of Wisconsin-Madison
       <reviews at linguistlist.org>

Homepage: http://linguistlist.org

The LINGUIST List is a non-profit organization dedicated to providing the
discipline of linguistics with the infrastructure necessary to function in
the digital world. Donate to keep our services freely available!
https://linguistlist.org/donation/donate/donate1.cfm

Editor for this issue: Brent Miller <brent at linguistlist.org>
================================================================  

Visit LL's Multitree project for over 1000 trees dynamically generated
from scholarly hypotheses about language relationships:
          http://multitree.linguistlist.org/
					
					
To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.cfm.


Date: Fri, 11 May 2012 12:10:51
From: Richard Whitt [jasonwhitt at mindspring.com]
Subject: GerManC Corpus is Now Available

E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=23-2276.html&submissionid=4546326&topicid=6&msgnumber=1
 
The complete GerManC Corpus, a representative corpus of Early 
Modern German from 1650 to 1800, is now publicly available at the 
Oxford Text Archive:
http://www.ota.ox.ac.uk/desc/2544

Following the model of the ARCHER corpus and given the aim of 
representativeness, the GerManC corpus consists of text samples of 
about 2000 words from eight genres: drama, newspapers, sermons 
and personal letters (to represent orally oriented registers) and 
narrative prose (fiction or non-fiction), scholarly (i.e. humanities), 
scientific and legal texts (to represent more print-oriented registers). In 
order to facilitate tracing historical developments, the whole period was 
divided into fifty year sections (in this case 1650-1700, 1700-1750 and 
1750-1800), and an equal number of texts from each genre was 
selected for each of these sub-periods.

The complete corpus thus consists of 360 samples, comprising 
approximately 800,000 words. Appendix 1 in the download package 
contains a lists of the files in the corpus with full documentation in an 
Excel spreadsheet. 

Project Team: Martin Durrell (PI), Paul Bennett (Co-Investigator), Silke 
Scheible (RA), Richard J. Whitt (RA), and Astrid Ensslin (RA, 
Newspaper Corpus). 



Linguistic Field(s): Computational Linguistics
                     Historical Linguistics
                     Text/Corpus Linguistics





 






----------------------------------------------------------
LINGUIST List: Vol-23-2276	
----------------------------------------------------------
Visit LL's Multitree project for over 1000 trees dynamically generated
from scholarly hypotheses about language relationships:
          http://multitree.linguistlist.org/
					
					



More information about the LINGUIST mailing list