23.2362, FYI: The GerManC Corpus is Now Available
linguist at linguistlist.org
linguist at linguistlist.org
Thu May 17 14:38:31 UTC 2012
LINGUIST List: Vol-23-2362. Thu May 17 2012. ISSN: 1069 - 4875.
Subject: 23.2362, FYI: The GerManC Corpus is Now Available
Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>
Reviews: Veronika Drake, U of Wisconsin-Madison
Monica Macaulay, U of Wisconsin-Madison
Rajiv Rao, U of Wisconsin-Madison
Joseph Salmons, U of Wisconsin-Madison
Anja Wanner, U of Wisconsin-Madison
<reviews at linguistlist.org>
Homepage: http://linguistlist.org
The LINGUIST List is a non-profit organization dedicated to providing the
discipline of linguistics with the infrastructure necessary to function in
the digital world. Donate to keep our services freely available!
https://linguistlist.org/donation/donate/donate1.cfm
Editor for this issue: Brent Miller <brent at linguistlist.org>
================================================================
Visit LL's Multitree project for over 1000 trees dynamically generated
from scholarly hypotheses about language relationships:
http://multitree.linguistlist.org/
To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.cfm.
Date: Thu, 17 May 2012 10:38:28
From: Richard J. Whitt [jasonwhitt at mindspring.com]
Subject: The GerManC Corpus is Now Available
E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=23-2362.html&submissionid=4546679&topicid=6&msgnumber=1
The GerManC Corpus, a multi-genre representative corpus of Early
Modern German from 1650-1800, is now publicly available for
download at
http://www.ota.ox.ac.uk/desc/2544.
Following the model of the ARCHER corpus and given the aim of
representativeness, the GerManC corpus consists of text samples of
about 2000 words from eight genres: drama, newspapers, sermons
and personal letters (to represent orally oriented registers) and
narrative prose (fiction or non-fiction), scholarly (i.e. humanities),
scientific and legal texts (to represent more print-oriented registers). In
order to facilitate tracing historical developments, the whole period was
divided into fifty year sections (in this case 1650-1700, 1700-1750 and
1750-1800), and an equal number of texts from each genre was
selected for each of these sub-periods.
The complete corpus thus consists of 360 samples, comprising
approximately 800,000 words. Appendix 1 in the download package
contains a lists of the files in the corpus with full documentation in an
Excel spreadsheet. In addition to plain text, the corpus is also available
in TEI Lite P5 XML, GATE XML, and GATE column formats.
Project web-site:
http://www.llc.manchester.ac.uk/research/projects/germanc/
Project Team: Martin Durrell (PI), Paul Bennett (Co-Investigator), Silke
Scheible (RA), Richard J. Whitt (RA), and Astrid Ensslin (RA,
newspaper corpus)
Linguistic Field(s): Computational Linguistics
Historical Linguistics
Text/Corpus Linguistics
----------------------------------------------------------
LINGUIST List: Vol-23-2362
----------------------------------------------------------
Visit LL's Multitree project for over 1000 trees dynamically generated
from scholarly hypotheses about language relationships:
http://multitree.linguistlist.org/
More information about the LINGUIST
mailing list