[Corpora-List] The most free and open parsed corpus possible?

Anton Karl Ingason anton.karl.ingason at gmail.com
Mon Jan 10 17:31:28 UTC 2011


Numerous argument exist for the benefits of free and open resources. In our
corpus project, the Icelandic Parsed Historical Corpus (IcePaHC), one of our
goals is to identify how we can make the most out of these benefits and
compare our approach to the approaches that others have taken with their
parsed corpora (the same issues will of course in many cases apply
equivalently to other types of resources). Our goal is not to "win the
competition of the most free parsed corpus", but rather to learn what steps
one might take to maximize the benefits of such an approach, while doing our
best to carry out these steps in the context of our project.

Below is a list of steps we decided to pursue to this end.
We would like to ask Corpora List:
- Are there some other concrete steps that we should state explicitly in
order to acheive our goal?
- Do you disagree with some of the steps?
- What is the situation for other parsed corpora with regard to the steps we
list? In particular it would be useful to get a "yes/no/comment" for each
item on the list for a particular corpus and/or a reference to a
paper/website that can be cited for that information.

The steps we have taken with IcePaHC:
1) Raw data available can be downloaded for local use (corpus not hidden
behind a search interface)
2) Comprehensive documentation freely available online
3) Available without registration, user identification of some sort, or
signing of contracts
4) Development process of corpus relies only on free/open source software
tools (for transparent replication of annotation process)
5) Open development (annotation is carried out in an open online version
control repository for transparency regarding the actual steps taken in the
development and immediate access to work-in-progress)
6) Regular scheduled releases of numbered versions during development as
well as for more permanent milestone versions so that researchers can always
produce replicable results on a recent version of the corpus
7) Users can improve the corpus and release modified versions without
special permission
8) Free of cost to academia
9) Free of cost to commercial users
10) Corpus released under a standard free license of some sort for
straightforward compatibility with other projects (GPL, LGPL, CC, etc.)

The latest version of our corpus, IcePaHC, preview version 0.3, with 262.000
words is available for download as described in the announcement below.

-----------

Available: Icelandic Parsed Historical Corpus, V0.3

We are pleased to announce that version 0.3 of the Icelandic Parsed
Historical Corpus (IcePaHC) is now available for free download.

The corpus is syntactically parsed, annotated for full phrase structure
using an adaptation of the annotation scheme used by the Penn parsed corpora
of historical English (http://www.ling.upenn.edu/hist-corpora/) and other
corpora in that tradition (see links from website). The corpus contains ca.
262.000 words from every century between the 12th and the 19th centuries
inclusive. Please note that this is about a quarter of the ultimate goal for
the completed corpus, ca. 1 million words.

The corpus is distributed as raw UTF-8 data in labeled bracketing format and
it is therefore compatible with various existing programs, including
CorpusSearch (http://corpussearch.sourceforge.net/).

The corpus can be downloaded from:
www.linguist.is/icelandic_treebank/Download

Further information on the annotation guidelines and project organization
can be found on the project wiki:
www.linguist.is/icelandic_treebank/

We hope that this release will result in feedback that allows us to improve
the resource for upcoming versions. Updates are released every three months
- the upcoming 0.4 version will be released on April 4th 2011. Between
releases, development can be tracked at our open repository at Github (
http://github.com/antonkarl/icecorpus) but use of released versions is
encouraged to ensure that results can be replicated.

Texts included in Version 0.3:
4439 words from The First Grammatical Treatise (entire text) (12th century)
8179 words from Íslensk hómilíubok (Icelandic book of homilies) (12th
century)
3459 words from Egils saga (theta fragment) (13th century)
22720 words from Sturlunga saga (13th century)
23040 words from Finnboga saga ramma (1350)
11486 words from Bandamanna saga (1450)
23041 words from Vilhjálms saga Sjóðs (1450)
8582 words from Erasmus saga (1525)
20683 words from the New Testament's Gospel of John (1540)
16421 words from the New Testament's Acts (1540)
17127 words from Ólafur Egilsson's travelogue (1628)
9760 words from Píslarsaga Jóns Magnússonar (1659)
22905 words from Jón Indíafari's travelogue (1661)
22099 words from Jón Steingrímsson's biography (1791)
3269 words from Jónas Hallgrímsson's essay on the nature and origin of the
earth (1835)
17837 words from Piltur og stúlka (novel by Jón Thoroddsen) (1850)
27192 words from Brynjólfur Sveinsson biskup (novel by Torfhildur Hólm)
(1882)
Total number of words: 262240


Joel C. Wallenberg (joel.wallenberg at gmail.com)
Anton Karl Ingason (anton.karl.ingason at gmail.com)
Einar Freyr Sigurðsson (einarfs at gmail.com)
Eiríkur Rögnvaldsson (eirikur at hi.is)
University of Iceland

The project is funded by the following grants:

Icelandic Research Fund (RANNÍS), grant nr. 090662011,"Viable Language
Technology beyond English – Icelandic as a test case".

U.S. National Science Foundation (NSF) International Research Fellowship
Program (IRFP), grant #OISE-0853114, "Evolution of Language Systems: a
comparative study of grammatical change in Icelandic and English".
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110110/f45eb5c2/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list