[Corpora-List] IcePaHC 0.9. 1 million words of syntactically parsed (hand-corrected) Icelandic

Anton Karl Ingason anton.karl.ingason at gmail.com
Mon Aug 29 13:59:08 UTC 2011


We are very pleased to announce that version 0.9 of the Icelandic Parsed
Historical Corpus (IcePaHC) is now available for free download.

The corpus can be downloaded from:
www.linguist.is/icelandic_treebank/Download

The corpus is a treebank of over 1 million words in size, annotated for full
phrase structure parse, and hand-corrected, using an adaptation of the
annotation scheme used by the Penn Treebank and the Penn parsed corpora of
historical English (http://www.ling.upenn.edu/hist-corpora/). Note that this
release contains all of the text for version 1.0, but some minor corrections
remain to be finished.

The corpus contains:

- 1 002 361 words total, consisting of ~100 000-word samples from each
century from the 12th to the beginnng of the 21st century.
- Annotated with a phrase structure parse, part-of-speech-tagged, and
lemmatized.
- The entire parse, pos-tagging, and lemmata for every sentence have been
*hand-corrected*.
- Text samples are balanced for genre within each century.
- LGPL license: You are free to copy, modify and redistribute the corpus for
research and/or profit with appropriate citation.

The corpus is distributed as raw UTF-8 data in labeled bracketing format and
it is therefore compatible with various existing programs, including
CorpusSearch (http://corpussearch.sourceforge.net/).

A plain text version without markup and a set of info files containing
philological information accompany the corpus download.

The entire corpus may be downloaded in a plain text version, a
platform-independent GUI, and a Windows-compatible GUI for ease of
searching.

Further information on the annotation guidelines and project organization
can be found on the project wiki:
www.linguist.is/icelandic_treebank/


Joel C. Wallenberg (joel.wallenberg at gmail.com)
Anton Karl Ingason (anton.karl.ingason at gmail.com)
Einar Freyr Sigurðsson (einarfs at gmail.com)
Eiríkur Rögnvaldsson (eirikur at hi.is)
University of Iceland

We were grateful to receive support for this project through the following
grants:

Icelandic Research Fund (RANNÍS), grant nr. 090662011,"Viable Language
Technology beyond English – Icelandic as a test case".

U.S. National Science Foundation (NSF) International Research Fellowship
Program (IRFP), grant #OISE-0853114, "Evolution of Language Systems: a
comparative study of grammatical change in Icelandic and English".

University of Iceland Research Fund (Rannsóknasjóður Háskóla Íslands), grant
Icelandic Diachronic Treebank (Sögulegur íslenskur trjábanki)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110829/daf3c692/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list