[Corpora-List] SourceForge as a corpus

Fri Jan 25 12:13:04 UTC 2008

As an earlier post noted, the Oxford Text Archive does indeed archive, 
preserve and distribute language corpora, as well as many other types of 
literary and linguistic resources.

You can see what corpora we have at:

http://www.ota.ox.ac.uk/search/search.perl?search=QUICK&misc=corpus

(You can always see this list by searching for 'corpus' from our home page.)

Our current funding as part of the Arts and Humanities Data Service in 
the UK is being cut from the end of March 2008, and while we will 
continue to operate, we will no longer be able to offer a free service 
for corpus builders to deposit, although the corpora will still be free 
to download and use. We will shortly, in the next few weeks, be 
publishing a charging policy based on the costs of ongoing data 
preservation. And we will be looking for more  funding that might allow 
us to return to offering a service free to depositors in the future.

As far as licensing is concerned, we currently use our own OTA licence, 
which allows educational use, but which prohibits re-distribution. We 
are however keen to move to using an appropriate Creative Commons, or 
other similar, licence to allow more open access and re-use.

We don't currently have facilities for is self-archiving, or for users 
to add to or enhance resources. Although users are welcome to take 
corpora, enhance them and then re-deposit. Another possible future route 
for storing and sharing corpora wil be the new Google system for 
archiving research data, to be launched shortly, which was recently 
reported here:

http://blog.wired.com/wiredscience/2008/01/google-to-provi.html

What Google and Sourceforge won't provide is specialist services for 
language resources, which can include:
- advice on resource creation
- model licences tailored for language resources
- advice on metadata standards
- expert curation of the data, including data migration, updating metadata
- enhanced access services for online querying of the data
- shared resource discovery with other language resource archives
- shared access & authorisation
- distributed processing (e.g. querying data across more than one archive).

Various initiatives involving existing repositories and centres of 
expertise are aiming to improve or build these services, including OLAC, 
which is focussed on resource discovery, and CLARIN, a new pan-European 
initiative to build a language resources infrastructure.

Martin

-- 
Martin Wynne
Head of the Oxford Text Archive and
AHDS Literature, Languages and Linguistics

Oxford University Computing Services
13 Banbury Road
Oxford
UK - OX2 6NN
Tel: +44 1865 283299
Fax: +44 1865 273275
martin.wynne at oucs.ox.ac.uk

Francis Tyers wrote:
> El jue, 24-01-2008 a las 23:43 +0100, Klaus Guenther escribió:
>
>   
>> In order for this to work, I'd like to see a license agreement for open 
>> corpora. We could start with Creative Commons licensing, and move on to 
>> a license unique to open corpora. In addition, there could be several 
>> commercial corpus licenses. The days where written corpora were 
>> expensive to create are mostly over. We're seeing new corpora based on 
>> web data arising, including monitor corpora such as accompanies the ANC. 
>> If the GPL works for FOSS, we can also work something out that works for 
>> corpora. By all pulling together and collaborating to create useful 
>> corpora, we can create a new frontier in linguistics.
>>
>>     
>
> Ideally the corpora would be dual licensed under a Creative Commons
> Licence, _and_ the GPL. This would allow corpora, or parts of corpora to
> be easily distributed _and packaged_¹ with GPL software. For example,
> for the purposes of training. My particular preference would actually be
> triple licensing under:
>
> * Creative Commons BY-SA (3.0 or later)
> * GPL (v2 or later)
> * GFDL (with no invariant sections) 
>
> This allows licence compatibility with free software (GPL), free
> software documentation (often GFDL) and other open content (increasingly
> CC). 
>
> The easier option of course would be to make everything public domain
> (or equivalent in countries without that concept) and then let people
> choose their own licence for derivatives.
>
> Thanks, from licence purgatory,
>
> Fran
>
> ¹ One of the biggest GNU/Linux distributions, Debian, currently
> considers variants of the GFDL with invariant sections, and all CC
> licences other than CC-BY-SA 3.0 non-free.
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>   

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora