[Corpora-List] The most free and open parsed corpus possible?

Anton Karl Ingason anton.karl.ingason at gmail.com
Tue Jan 11 14:20:10 UTC 2011


Thank you for the suggestion. We will look into the LGPL-LR.

I would also like to clarify that we do explicitly state the LGPL license in
a readme file and at the beginning of all data files in the corpus (in
addition to our website and announcements), and the full GPL/LGPL text is
included in the root directory of the distribution.

Anton

On Tue, Jan 11, 2011 at 10:24 AM, Karen Fort <karen.fort at inist.fr> wrote:

> Hi,
>
> If it's a corpus, then you should have a look at the LGPL-LR (for
> Linguistic Resource) licence.
>
> Also note that without the explicit mention of the license (for example
> using a lisence.txt file), the corpus rights are the most restrictive.
>
> Hope this helps,
>
> Karen
>
> Le 10/01/2011 18:31, Anton Karl Ingason a écrit :
>
>> Numerous argument exist for the benefits of free and open resources. In
>> our corpus project, the Icelandic Parsed Historical Corpus (IcePaHC),
>> one of our goals is to identify how we can make the most out of these
>> benefits and compare our approach to the approaches that others have
>> taken with their parsed corpora (the same issues will of course in many
>> cases apply equivalently to other types of resources). Our goal is not
>> to "win the competition of the most free parsed corpus", but rather to
>> learn what steps one might take to maximize the benefits of such an
>> approach, while doing our best to carry out these steps in the context
>> of our project.
>>
>> Below is a list of steps we decided to pursue to this end.
>> We would like to ask Corpora List:
>> - Are there some other concrete steps that we should state explicitly in
>> order to acheive our goal?
>> - Do you disagree with some of the steps?
>> - What is the situation for other parsed corpora with regard to the
>> steps we list? In particular it would be useful to get a
>> "yes/no/comment" for each item on the list for a particular corpus
>> and/or a reference to a paper/website that can be cited for that
>> information.
>>
>> The steps we have taken with IcePaHC:
>> 1) Raw data available can be downloaded for local use (corpus not hidden
>> behind a search interface)
>> 2) Comprehensive documentation freely available online
>> 3) Available without registration, user identification of some sort, or
>> signing of contracts
>> 4) Development process of corpus relies only on free/open source
>> software tools (for transparent replication of annotation process)
>> 5) Open development (annotation is carried out in an open online version
>> control repository for transparency regarding the actual steps taken in
>> the development and immediate access to work-in-progress)
>> 6) Regular scheduled releases of numbered versions during development as
>> well as for more permanent milestone versions so that researchers can
>> always produce replicable results on a recent version of the corpus
>> 7) Users can improve the corpus and release modified versions without
>> special permission
>> 8) Free of cost to academia
>> 9) Free of cost to commercial users
>> 10) Corpus released under a standard free license of some sort for
>> straightforward compatibility with other projects (GPL, LGPL, CC, etc.)
>>
>> The latest version of our corpus, IcePaHC, preview version 0.3, with
>> 262.000 words is available for download as described in the announcement
>> below.
>>
>> -----------
>>
>> Available: Icelandic Parsed Historical Corpus, V0.3
>>
>> We are pleased to announce that version 0.3 of the Icelandic Parsed
>> Historical Corpus (IcePaHC) is now available for free download.
>>
>> The corpus is syntactically parsed, annotated for full phrase structure
>> using an adaptation of the annotation scheme used by the Penn parsed
>> corpora of historical English (http://www.ling.upenn.edu/hist-corpora/)
>> and other corpora in that tradition (see links from website). The corpus
>> contains ca. 262.000 words from every century between the 12th and the
>> 19th centuries inclusive. Please note that this is about a quarter of
>> the ultimate goal for the completed corpus, ca. 1 million words.
>>
>> The corpus is distributed as raw UTF-8 data in labeled bracketing format
>> and it is therefore compatible with various existing programs, including
>> CorpusSearch (http://corpussearch.sourceforge.net/).
>>
>> The corpus can be downloaded from:
>> www.linguist.is/icelandic_treebank/Download
>> <http://www.linguist.is/icelandic_treebank/Download>
>>
>>
>> Further information on the annotation guidelines and project
>> organization can be found on the project wiki:
>> www.linguist.is/icelandic_treebank/
>> <http://www.linguist.is/icelandic_treebank/>
>>
>>
>> We hope that this release will result in feedback that allows us to
>> improve the resource for upcoming versions. Updates are released every
>> three months - the upcoming 0.4 version will be released on April 4th
>> 2011. Between releases, development can be tracked at our open
>> repository at Github (http://github.com/antonkarl/icecorpus) but use of
>> released versions is encouraged to ensure that results can be replicated.
>>
>> Texts included in Version 0.3:
>> 4439 words from The First Grammatical Treatise (entire text) (12th
>> century)
>> 8179 words from Íslensk hómilíubok (Icelandic book of homilies) (12th
>> century)
>> 3459 words from Egils saga (theta fragment) (13th century)
>> 22720 words from Sturlunga saga (13th century)
>> 23040 words from Finnboga saga ramma (1350)
>> 11486 words from Bandamanna saga (1450)
>> 23041 words from Vilhjálms saga Sjóðs (1450)
>> 8582 words from Erasmus saga (1525)
>> 20683 words from the New Testament's Gospel of John (1540)
>> 16421 words from the New Testament's Acts (1540)
>> 17127 words from Ólafur Egilsson's travelogue (1628)
>> 9760 words from Píslarsaga Jóns Magnússonar (1659)
>> 22905 words from Jón Indíafari's travelogue (1661)
>> 22099 words from Jón Steingrímsson's biography (1791)
>> 3269 words from Jónas Hallgrímsson's essay on the nature and origin of
>> the earth (1835)
>> 17837 words from Piltur og stúlka (novel by Jón Thoroddsen) (1850)
>> 27192 words from Brynjólfur Sveinsson biskup (novel by Torfhildur Hólm)
>> (1882)
>> Total number of words: 262240
>>
>>
>> Joel C. Wallenberg (joel.wallenberg at gmail.com
>> <mailto:joel.wallenberg at gmail.com>)
>>
>> Anton Karl Ingason (anton.karl.ingason at gmail.com
>> <mailto:anton.karl.ingason at gmail.com>)
>> Einar Freyr Sigurðsson (einarfs at gmail.com <mailto:einarfs at gmail.com>)
>> Eiríkur Rögnvaldsson (eirikur at hi.is <mailto:eirikur at hi.is>)
>>
>> University of Iceland
>>
>> The project is funded by the following grants:
>>
>> Icelandic Research Fund (RANNÍS), grant nr. 090662011,"Viable Language
>> Technology beyond English – Icelandic as a test case".
>>
>> U.S. National Science Foundation (NSF) International Research Fellowship
>> Program (IRFP), grant #OISE-0853114, "Evolution of Language Systems: a
>> comparative study of grammatical change in Icelandic and English".
>>
>>
> --
> Karën FORT
> Ingénieure/Engineer et/and doctorante/PhD student
> INIST-CNRS / LIPN
> 2, allée de Brabois
> 54500 Vandoeuvre-lès-Nancy
> France
> Bureau/Office: H112
> +33 (0)3 83 50 46 36
>
> http://www-lipn.univ-paris13.fr/~fort/
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



-- 
www.linguist.is
s: 846 2613 / tel: +354 846 2613
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110111/314cec47/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list