[Corpora-List] free tagged corpus
Martin Wynne
martin.wynne at oucs.ox.ac.uk
Thu Nov 17 11:13:28 UTC 2005
Dear Delip,
It depends on what you mean by 'freely available'. This has (at least)
two meanings in this context. It can mean free of cost, or it can mean
free of legal or ethical restrictions on its use.
Many corpora are do not cost money to use, although the ones mentioned
so far in this thread, such as the BNC and resources from the LDC, do
cost money.
As for legal and ethical restrictions, it may be useful to look at what
they say in the world of software, where several levels of freedom can
be differentiated:
* The freedom to run the program, for any purpose (freedom 0).
* The freedom to study how the program works, and adapt it to your
needs (freedom 1). Access to the source code is a precondition for this.
* The freedom to redistribute copies so you can help your neighbor
(freedom 2).
* The freedom to improve the program, and release your improvements
to the public, so that the whole community benefits (freedom 3). Access
to the source code is a precondition for this.
(from http://www.gnu.org/philosophy/free-sw.html)
With corpora, a parallel classification may be possible:
* The freedom to access and analyse the corpus (freedom 0).
* The freedom to run your own tools on the corpus, and adapt it to
your needs (freedom 1). Access to the full text of the corpus is a
precondition for this.
* The freedom to redistribute copies so you can help your neighbor
(freedom 2).
* The freedom to add texts or metadata or annotations, and release
your improvements to the public, so that the whole community benefits
(freedom 3).
In most cases, any of the above freedoms may be restricted by only
allowing the relevant freedoms in the context of academic or
non-commercial research, though the precise terms of these restrictions
may vary, and the boundaries of non-commercial may not be easy to draw.
Usually a corpus creator cannot simply release a corpus under terms of
their choosing, allowing whichever of the above freedoms they want to,
because they don't own the rights over all of the texts contained in the
corpus. A corpus usually contains texts written or spoken by various
people, and these people, or publishers, or employers, or others, are
likely to have intellectual property rights over these texts.
(Furthermore, the corpus builders are acquire rights over the
collection, but these may reside not in the individuals but in their
institution or funders). To complicate things further, the relevant laws
relating to these rights vary in different countries, and have varied
over time.
My colleague Lou Burnard asked a similar question on this list in
January this year. You can see the start of the thread in the archive at
http://listserv.linguistlist.org/cgi-bin/wa?A2=ind0501&L=CORPORA&D=0&I=-3&P=13048
He was surprised to find virtually nothing which could be distributed
under something like an open source software licence.
The simplest answer to this is that you have to say a bit more precisely
what it is you want to be free to do with the corpus, and then maybe
you'll get some more answers.
Best wishes,
Martin
Delip Rao wrote:
> Hello All,
>
> Is there any freely available part-of-speech tagged
> corpus for research/non-commercial use?
>
> Thanks,
> Delip Rao
> -----------
> AIDB LAB,
> IIT MADRAS
>
>
>
>
>
> __________________________________
> Do you Yahoo!?
> New and Improved Yahoo! Mail - 1GB free storage!
> http://sg.whatsnew.mail.yahoo.com
>
>
--
Martin Wynne
Head of the Oxford Text Archive and
AHDS Literature, Languages and Linguistics
Oxford University Computing Services
13 Banbury Road
Oxford
UK - OX2 6NN
Tel: +44 1865 283299
Fax: +44 1865 273275
martin.wynne at oucs.ox.ac.uk
More information about the Corpora
mailing list