22.3413, FYI: IcePaHC 0.9.: 1 Million Words, Icelandic Treebank
linguist at LINGUISTLIST.ORG
linguist at LINGUISTLIST.ORG
Tue Aug 30 14:30:19 UTC 2011
LINGUIST List: Vol-22-3413. Tue Aug 30 2011. ISSN: 1068 - 4875.
Subject: 22.3413, FYI: IcePaHC 0.9.: 1 Million Words, Icelandic Treebank
Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>
Reviews: Veronika Drake, U of Wisconsin-Madison
Monica Macaulay, U of Wisconsin-Madison
Rajiv Rao, U of Wisconsin-Madison
Joseph Salmons, U of Wisconsin-Madison
Anja Wanner, U of Wisconsin-Madison
<reviews at linguistlist.org>
Homepage: http://linguistlist.org/
The LINGUIST List is funded by Eastern Michigan University,
and donations from subscribers and publishers.
Editor for this issue: Brent Miller <brent at linguistlist.org>
================================================================
Visit LL's Multitree project for over 1000 trees dynamically generated
from scholarly hypotheses about language relationships:
http://multitree.linguistlist.org/
To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.cfm.
===========================Directory==============================
1)
Date: 29-Aug-2011
From: Joel Wallenberg [joel.wallenberg at gmail.com]
Subject: IcePaHC 0.9.: 1 Million Words, Icelandic Treebank
-------------------------Message 1 ----------------------------------
Date: Tue, 30 Aug 2011 10:29:41
From: Joel Wallenberg [joel.wallenberg at gmail.com]
Subject: IcePaHC 0.9.: 1 Million Words, Icelandic Treebank
E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=22-3413.html&submissionid=4530587&topicid=6&msgnumber=1
We are very pleased to announce that version 0.9 of the Icelandic
Parsed Historical Corpus (IcePaHC) is now available for free download.
The corpus can be downloaded from:
www.linguist.is/icelandic_treebank/Download
The corpus is a treebank of over 1 million words in size, annotated for
full phrase structure parse, and hand-corrected, using an adaptation of
the annotation scheme used by the Penn Treebank and the Penn
parsed corpora of historical English (http://www.ling.upenn.edu/hist-
corpora/). Note that this release contains all of the text for version 1.0,
but some minor corrections remain to be finished.
The corpus contains:
- 1 002 361 words total, consisting of ~100 000-word samples from
each century from the 12th to the beginnng of the 21st century.
- Annotated with a phrase structure parse, part-of-speech-tagged, and
lemmatized.
- The entire parse, pos-tagging, and lemmata for every sentence have
been *hand-corrected*.
- Text samples are balanced for genre within each century.
- LGPL license: You are free to copy, modify and redistribute the
corpus for research and/or profit with appropriate citation.
The corpus is distributed as raw UTF-8 data in labeled bracketing
format and it is therefore compatible with various existing programs,
including CorpusSearch (http://corpussearch.sourceforge.net/).
A plain text version without markup and a set of info files containing
philological information accompany the corpus download.
The entire corpus may be downloaded in a plain text version, a
platform-independent GUI, and a Windows-compatible GUI for ease of
searching.
Further information on the annotation guidelines and project
organization can be found on the project wiki:
www.linguist.is/icelandic_treebank/
Joel C. Wallenberg (joel.wallenberg at gmail.com)
Anton Karl Ingason (anton.karl.ingason at gmail.com)
Einar Freyr Sigurðsson (einarfs at gmail.com)
Eiríkur Rögnvaldsson (eirikur at hi.is)
University of Iceland
We were grateful to receive support for this project through the
following grants:
Icelandic Research Fund (RANNÍS), grant nr. 090662011,''Viable
Language Technology beyond English - Icelandic as a test case''.
U.S. National Science Foundation (NSF) International Research
Fellowship Program (IRFP), grant #OISE-0853114, ''Evolution of
Language Systems: a comparative study of grammatical change in
Icelandic and English''.
University of Iceland Research Fund (Rannsóknasjóður Háskóla
Íslands), grant Icelandic Diachronic Treebank (Sögulegur íslenskur
trjábanki)
Linguistic Field(s): Computational Linguistics
Historical Linguistics
Syntax
Text/Corpus Linguistics
-----------------------------------------------------------
LINGUIST List: Vol-22-3413
----------------------------------------------------------
Visit LL's Multitree project for over 1000 trees dynamically generated
from scholarly hypotheses about language relationships:
http://multitree.linguistlist.org/
More information about the LINGUIST
mailing list