23.2362, FYI: The GerManC Corpus is Now Available

linguist at linguistlist.org linguist at linguistlist.org
Thu May 17 14:38:31 UTC 2012


LINGUIST List: Vol-23-2362. Thu May 17 2012. ISSN: 1069 - 4875.

Subject: 23.2362, FYI: The GerManC Corpus is Now Available

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews: Veronika Drake, U of Wisconsin-Madison
Monica Macaulay, U of Wisconsin-Madison
Rajiv Rao, U of Wisconsin-Madison
Joseph Salmons, U of Wisconsin-Madison
Anja Wanner, U of Wisconsin-Madison
       <reviews at linguistlist.org>

Homepage: http://linguistlist.org

The LINGUIST List is a non-profit organization dedicated to providing the
discipline of linguistics with the infrastructure necessary to function in
the digital world. Donate to keep our services freely available!
https://linguistlist.org/donation/donate/donate1.cfm

Editor for this issue: Brent Miller <brent at linguistlist.org>
================================================================  

Visit LL's Multitree project for over 1000 trees dynamically generated
from scholarly hypotheses about language relationships:
          http://multitree.linguistlist.org/
					
					
To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.cfm.


Date: Thu, 17 May 2012 10:38:28
From: Richard J. Whitt [jasonwhitt at mindspring.com]
Subject: The GerManC Corpus is Now Available

E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=23-2362.html&submissionid=4546679&topicid=6&msgnumber=1
 
The GerManC Corpus, a multi-genre representative corpus of Early 
Modern German from 1650-1800, is now publicly available for 
download at

http://www.ota.ox.ac.uk/desc/2544.

Following the model of the ARCHER corpus and given the aim of 
representativeness, the GerManC corpus consists of text samples of 
about 2000 words from eight genres: drama, newspapers, sermons 
and personal letters (to represent orally oriented registers) and 
narrative prose (fiction or non-fiction), scholarly (i.e. humanities), 
scientific and legal texts (to represent more print-oriented registers). In 
order to facilitate tracing historical developments, the whole period was 
divided into fifty year sections (in this case 1650-1700, 1700-1750 and 
1750-1800), and an equal number of texts from each genre was 
selected for each of these sub-periods.

The complete corpus thus consists of 360 samples, comprising 
approximately 800,000 words. Appendix 1 in the download package 
contains a lists of the files in the corpus with full documentation in an 
Excel spreadsheet. In addition to plain text, the corpus is also available 
in TEI Lite P5 XML, GATE XML, and GATE column formats.

Project web-site:

http://www.llc.manchester.ac.uk/research/projects/germanc/

Project Team: Martin Durrell (PI), Paul Bennett (Co-Investigator), Silke 
Scheible (RA), Richard J. Whitt (RA), and Astrid Ensslin (RA, 
newspaper corpus) 



Linguistic Field(s): Computational Linguistics
                     Historical Linguistics
                     Text/Corpus Linguistics





 






----------------------------------------------------------
LINGUIST List: Vol-23-2362	
----------------------------------------------------------
Visit LL's Multitree project for over 1000 trees dynamically generated
from scholarly hypotheses about language relationships:
          http://multitree.linguistlist.org/
					
					



More information about the LINGUIST mailing list