22.144, FYI: Icelandic Parsed Historical Corpus (IcePaHC) V0.3

linguist at LINGUISTLIST.ORG linguist at LINGUISTLIST.ORG
Sat Jan 8 15:53:40 UTC 2011


LINGUIST List: Vol-22-144. Sat Jan 08 2011. ISSN: 1068 - 4875.

Subject: 22.144, FYI: Icelandic Parsed Historical Corpus (IcePaHC) V0.3

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>
 
Reviews: Monica Macaulay, U of Wisconsin-Madison  
Eric Raimy, U of Wisconsin-Madison  
Joseph Salmons, U of Wisconsin-Madison  
Anja Wanner, U of Wisconsin-Madison  
       <reviews at linguistlist.org> 

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, 
and donations from subscribers and publishers.

Editor for this issue: Brent Miller <brent at linguistlist.org>
================================================================  

To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.cfm.

===========================Directory==============================  

1)
Date: 06-Jan-2011
From: Joel Wallenberg [joel.wallenberg at gmail.com]
Subject: Icelandic Parsed Historical Corpus (IcePaHC) V0.3
 

	
-------------------------Message 1 ---------------------------------- 
Date: Sat, 08 Jan 2011 10:51:12
From: Joel Wallenberg [joel.wallenberg at gmail.com]
Subject: Icelandic Parsed Historical Corpus (IcePaHC) V0.3

E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=22-144.html&submissionid=3798297&topicid=6&msgnumber=1
  


We are pleased to announce that version 0.3 of the Icelandic Parsed
Historical Corpus (IcePaHC) is now available for free download. 

The corpus is syntactically parsed, annotated for full phrase structure
using an adaptation of the annotation scheme used by the Penn parsed
corpora of historical English (http://www.ling.upenn.edu/hist-corpora/) and
other corpora in that tradition (see links from website). The corpus
contains ca. 262.000 words from every century between the 12th and the 19th
centuries inclusive. Please note that this is about a quarter of the
ultimate goal for the completed corpus, ca. 1 million words.

The corpus is distributed as raw UTF-8 data in labeled bracketing format
and it is therefore compatible with various existing programs, including
CorpusSearch (http://corpussearch.sourceforge.net/).

The corpus can be downloaded from:
www.linguist.is/icelandic_treebank/Download

Further information on the annotation guidelines and project organization
can be found on the project wiki:
www.linguist.is/icelandic_treebank/

We hope that this release will result in feedback that allows us to improve
the resource for upcoming versions. Updates are released every three months
- the upcoming 0.4 version will be released on April 4th 2011. Between
releases, development can be tracked at our open repository at Github
(http://github.com/antonkarl/icecorpus) but use of released versions is
encouraged to ensure that results can be replicated.

Texts included in Version 0.3:
4439 words from The First Grammatical Treatise (entire text) (12th century)
8179 words from Íslensk hómilíubok (Icelandic book of homilies) (12th century)
3459 words from Egils saga (theta fragment) (13th century)
22720 words from Sturlunga saga (13th century)
23040 words from Finnboga saga ramma (1350)
11486 words from Bandamanna saga (1450)
23041 words from Vilhjálms saga Sjóðs (1450)
8582 words from Erasmus saga (1525)
20683 words from the New Testament's Gospel of John (1540)
16421 words from the New Testament's Acts (1540)
17127 words from Ólafur Egilsson's travelogue (1628)
9760 words from Píslarsaga Jóns Magnússonar (1659)
22905 words from Jón Indíafari's travelogue (1661)
22099 words from Jón Steingrímsson's biography (1791)
3269 words from Jónas Hallgrímsson's essay on the nature and origin of the
earth (1835)
17837 words from Piltur og stúlka (novel by Jón Thoroddsen) (1850)
27192 words from Brynjólfur Sveinsson biskup (novel by Torfhildur Hólm) (1882)
Total number of words: 262240


Joel C. Wallenberg (joel.wallenberg at gmail.com)
Anton Karl Ingason (anton.karl.ingason at gmail.com)
Einar Freyr Sigurðsson (einarfs at gmail.com)
Eiríkur Rögnvaldsson (eirikur at hi.is)
University of Iceland

The project is funded by the following grants:

Icelandic Research Fund (RANNÍS), grant nr. 090662011,''Viable Language
Technology beyond English - Icelandic as a test case''.

U.S. National Science Foundation (NSF) International Research Fellowship
Program (IRFP), grant #OISE-0853114, ''Evolution of Language Systems: a
comparative study of grammatical change in Icelandic and English''. 



Linguistic Field(s): Text/Corpus Linguistics

Subject Language(s): Icelandic (isl)





 




-----------------------------------------------------------
LINGUIST List: Vol-22-144	
----------------------------------------------------------


	



More information about the LINGUIST mailing list