[Corpora-List] The most free and open parsed corpus possible?

Karen Fort karen.fort at inist.fr
Tue Jan 11 10:24:36 UTC 2011


Hi,

If it's a corpus, then you should have a look at the LGPL-LR (for 
Linguistic Resource) licence.

Also note that without the explicit mention of the license (for example 
using a lisence.txt file), the corpus rights are the most restrictive.

Hope this helps,

Karen

Le 10/01/2011 18:31, Anton Karl Ingason a écrit :
> Numerous argument exist for the benefits of free and open resources. In
> our corpus project, the Icelandic Parsed Historical Corpus (IcePaHC),
> one of our goals is to identify how we can make the most out of these
> benefits and compare our approach to the approaches that others have
> taken with their parsed corpora (the same issues will of course in many
> cases apply equivalently to other types of resources). Our goal is not
> to "win the competition of the most free parsed corpus", but rather to
> learn what steps one might take to maximize the benefits of such an
> approach, while doing our best to carry out these steps in the context
> of our project.
>
> Below is a list of steps we decided to pursue to this end.
> We would like to ask Corpora List:
> - Are there some other concrete steps that we should state explicitly in
> order to acheive our goal?
> - Do you disagree with some of the steps?
> - What is the situation for other parsed corpora with regard to the
> steps we list? In particular it would be useful to get a
> "yes/no/comment" for each item on the list for a particular corpus
> and/or a reference to a paper/website that can be cited for that
> information.
>
> The steps we have taken with IcePaHC:
> 1) Raw data available can be downloaded for local use (corpus not hidden
> behind a search interface)
> 2) Comprehensive documentation freely available online
> 3) Available without registration, user identification of some sort, or
> signing of contracts
> 4) Development process of corpus relies only on free/open source
> software tools (for transparent replication of annotation process)
> 5) Open development (annotation is carried out in an open online version
> control repository for transparency regarding the actual steps taken in
> the development and immediate access to work-in-progress)
> 6) Regular scheduled releases of numbered versions during development as
> well as for more permanent milestone versions so that researchers can
> always produce replicable results on a recent version of the corpus
> 7) Users can improve the corpus and release modified versions without
> special permission
> 8) Free of cost to academia
> 9) Free of cost to commercial users
> 10) Corpus released under a standard free license of some sort for
> straightforward compatibility with other projects (GPL, LGPL, CC, etc.)
>
> The latest version of our corpus, IcePaHC, preview version 0.3, with
> 262.000 words is available for download as described in the announcement
> below.
>
> -----------
>
> Available: Icelandic Parsed Historical Corpus, V0.3
>
> We are pleased to announce that version 0.3 of the Icelandic Parsed
> Historical Corpus (IcePaHC) is now available for free download.
>
> The corpus is syntactically parsed, annotated for full phrase structure
> using an adaptation of the annotation scheme used by the Penn parsed
> corpora of historical English (http://www.ling.upenn.edu/hist-corpora/)
> and other corpora in that tradition (see links from website). The corpus
> contains ca. 262.000 words from every century between the 12th and the
> 19th centuries inclusive. Please note that this is about a quarter of
> the ultimate goal for the completed corpus, ca. 1 million words.
>
> The corpus is distributed as raw UTF-8 data in labeled bracketing format
> and it is therefore compatible with various existing programs, including
> CorpusSearch (http://corpussearch.sourceforge.net/).
>
> The corpus can be downloaded from:
> www.linguist.is/icelandic_treebank/Download
> <http://www.linguist.is/icelandic_treebank/Download>
>
> Further information on the annotation guidelines and project
> organization can be found on the project wiki:
> www.linguist.is/icelandic_treebank/
> <http://www.linguist.is/icelandic_treebank/>
>
> We hope that this release will result in feedback that allows us to
> improve the resource for upcoming versions. Updates are released every
> three months - the upcoming 0.4 version will be released on April 4th
> 2011. Between releases, development can be tracked at our open
> repository at Github (http://github.com/antonkarl/icecorpus) but use of
> released versions is encouraged to ensure that results can be replicated.
>
> Texts included in Version 0.3:
> 4439 words from The First Grammatical Treatise (entire text) (12th century)
> 8179 words from Íslensk hómilíubok (Icelandic book of homilies) (12th
> century)
> 3459 words from Egils saga (theta fragment) (13th century)
> 22720 words from Sturlunga saga (13th century)
> 23040 words from Finnboga saga ramma (1350)
> 11486 words from Bandamanna saga (1450)
> 23041 words from Vilhjálms saga Sjóðs (1450)
> 8582 words from Erasmus saga (1525)
> 20683 words from the New Testament's Gospel of John (1540)
> 16421 words from the New Testament's Acts (1540)
> 17127 words from Ólafur Egilsson's travelogue (1628)
> 9760 words from Píslarsaga Jóns Magnússonar (1659)
> 22905 words from Jón Indíafari's travelogue (1661)
> 22099 words from Jón Steingrímsson's biography (1791)
> 3269 words from Jónas Hallgrímsson's essay on the nature and origin of
> the earth (1835)
> 17837 words from Piltur og stúlka (novel by Jón Thoroddsen) (1850)
> 27192 words from Brynjólfur Sveinsson biskup (novel by Torfhildur Hólm)
> (1882)
> Total number of words: 262240
>
>
> Joel C. Wallenberg (joel.wallenberg at gmail.com
> <mailto:joel.wallenberg at gmail.com>)
> Anton Karl Ingason (anton.karl.ingason at gmail.com
> <mailto:anton.karl.ingason at gmail.com>)
> Einar Freyr Sigurðsson (einarfs at gmail.com <mailto:einarfs at gmail.com>)
> Eiríkur Rögnvaldsson (eirikur at hi.is <mailto:eirikur at hi.is>)
> University of Iceland
>
> The project is funded by the following grants:
>
> Icelandic Research Fund (RANNÍS), grant nr. 090662011,"Viable Language
> Technology beyond English – Icelandic as a test case".
>
> U.S. National Science Foundation (NSF) International Research Fellowship
> Program (IRFP), grant #OISE-0853114, "Evolution of Language Systems: a
> comparative study of grammatical change in Icelandic and English".
>

-- 
Karën FORT
Ingénieure/Engineer et/and doctorante/PhD student
INIST-CNRS / LIPN
2, allée de Brabois
54500 Vandoeuvre-lès-Nancy
France
Bureau/Office: H112
+33 (0)3 83 50 46 36

http://www-lipn.univ-paris13.fr/~fort/

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list