[Corpora-List] Available: Icelandic Parsed Historical Corpus (IcePaHC), V0.2

Anton Karl Ingason anton.karl.ingason at gmail.com
Fri Oct 1 18:04:26 UTC 2010


We are pleased to announce that version 0.2 of the Icelandic Parsed
Historical Corpus (IcePaHC) is now available for free download.

The corpus is syntactically parsed, annotated for full phrase structure
using an adaptation of the annotation scheme used by the Penn parsed corpora
of historical English (http://www.ling.upenn.edu/hist-corpora/) and other
corpora in that tradition (see links from website). The corpus contains ca.
120.000 words from 6 different centuries (12th, 13th, 16th, 17th, 18th and
19th). Please note that this is a small portion of the ultimate goal for the
completed corpus, ca. 1 million words from the 12th-19th centuries.

The corpus is distributed as raw UTF-8 data in labeled bracketing format and
it is therefore compatible with various existing programs, including
CorpusSearch (http://corpussearch.sourceforge.net/).

The corpus can be downloaded from:
www.linguist.is/icelandic_treebank/Download

Further information on the annotation guidelines and project organization
can be found on the project wiki:
www.linguist.is/icelandic_treebank/

We hope that this release will result in feedback that allows us to improve
the resource for upcoming versions. Updates are released every three months
- the upcoming 0.3 version will be released on January 1st 2011. Between
releases, development can be tracked at our open repository at Github (
http://github.com/antonkarl/icecorpus) but use of released versions is
encouraged to ensure that results can be replicated.

Texts included in Version 0.2:
4585 words from The First Grammatical Treatise (entire text) (12th century)
8179 words from Íslensk hómilíubok (Icelandic book of homilies) (12th
century)
3459 words from Egils saga (theta fragment) (13th century)
22719 words from Sturlunga saga (13th century)
20683 words from the New Testament's Gospel of John (1540)
16421 words from the New Testament's Acts (1540)
4521 words from Jón Indíafari's travelogue (1661)
22097 words from Jón Steingrímsson's biography (1791)
17837 words from Piltur og stúlka (novel by Jón Thoroddsen) (1850)
Total number of words: 120355


Joel Wallenberg (joel.wallenberg at gmail.com)
Anton Karl Ingason (anton.karl.ingason at gmail.com)
Einar Freyr Sigurðsson (einarfs at gmail.com)
Eiríkur Rögnvaldsson (eirikur at hi.is)
University of Iceland

The project is funded by the following grants:

Icelandic Research Fund (RANNÍS), grant nr. 090662011,"Viable Language
Technology beyond English – Icelandic as a test case".

U.S. National Science Foundation (NSF) International Research Fellowship
Program (IRFP), grant #OISE-0853114, "Evolution of Language Systems: a
comparative study of grammatical change in Icelandic and English".
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20101001/1d8b9321/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list