[Corpora-List] free tagged corpus

Martin Wynne martin.wynne at oucs.ox.ac.uk
Thu Nov 17 11:13:28 UTC 2005


Dear Delip,

It depends on what you mean by 'freely available'. This has (at least) 
two meanings in this context. It can mean free of cost, or it can mean 
free of legal or ethical restrictions on its use.

Many corpora are do not cost money to use, although the ones mentioned 
so far in this thread, such as the BNC and resources from the LDC, do 
cost money.

As for legal and ethical restrictions, it may be useful to look at what 
they say in the world of software, where several levels of freedom can 
be differentiated:

     *  The freedom to run the program, for any purpose (freedom 0).
     * The freedom to study how the program works, and adapt it to your 
needs (freedom 1). Access to the source code is a precondition for this.
     * The freedom to redistribute copies so you can help your neighbor 
(freedom 2).
     * The freedom to improve the program, and release your improvements 
to the public, so that the whole community benefits (freedom 3). Access 
to the source code is a precondition for this.

(from http://www.gnu.org/philosophy/free-sw.html)

With corpora, a parallel classification may be possible:

     * The freedom to access and analyse the corpus (freedom 0).
     * The freedom to run your own tools on the corpus, and adapt it to 
your needs (freedom 1). Access to the full text of the corpus is a 
precondition for this.
     * The freedom to redistribute copies so you can help your neighbor 
(freedom 2).
     * The freedom to add texts or metadata or annotations, and release 
your improvements to the public, so that the whole community benefits 
(freedom 3).

In most cases, any of the above freedoms may be restricted by only 
allowing the relevant freedoms in the context of academic or 
non-commercial research, though the precise terms of these restrictions 
may vary, and the boundaries of non-commercial may not be easy to draw.

Usually a corpus creator cannot simply release a corpus under terms of 
their choosing, allowing whichever of the above freedoms they want to, 
because they don't own the rights over all of the texts contained in the 
corpus. A corpus usually contains texts written or spoken by various 
people, and these people, or publishers, or employers, or others, are 
likely to have intellectual property rights over these texts. 
(Furthermore, the corpus builders are acquire rights over the 
collection, but these may reside not in the individuals but in their 
institution or funders). To complicate things further, the relevant laws 
relating to these rights vary in different countries, and have varied 
over time.

My colleague Lou Burnard asked a similar question on this list in 
January this year. You can see the start of the thread in the archive at
http://listserv.linguistlist.org/cgi-bin/wa?A2=ind0501&L=CORPORA&D=0&I=-3&P=13048
He was surprised to find virtually nothing which could be distributed 
under something like an open source software licence.

The simplest answer to this is that you have to say a bit more precisely 
what it is you want to be free to do with the corpus, and then maybe 
you'll get some more answers.

Best wishes,
Martin


Delip Rao wrote:
> Hello All,
> 
> Is there any freely available part-of-speech tagged
> corpus for research/non-commercial use?
> 
> Thanks,
> Delip Rao
> -----------
> AIDB LAB,
> IIT MADRAS
> 
> 
> 	
> 	
> 		
> __________________________________ 
> Do you Yahoo!? 
> New and Improved Yahoo! Mail - 1GB free storage! 
> http://sg.whatsnew.mail.yahoo.com
> 
> 


-- 
Martin Wynne
Head of the Oxford Text Archive and
AHDS Literature, Languages and Linguistics

Oxford University Computing Services
13 Banbury Road
Oxford
UK - OX2 6NN
Tel: +44 1865 283299
Fax: +44 1865 273275
martin.wynne at oucs.ox.ac.uk



More information about the Corpora mailing list