[Corpora-List] free tagged corpus

David Graff graff at ldc.upenn.edu
Thu Nov 17 15:03:26 UTC 2005


martin.wynne at oucs.ox.ac.uk said:
> With corpora, a parallel classification may be possible:
>
>      * The freedom to access and analyse the corpus (freedom 0).
>      * The freedom to run your own tools on the corpus, and adapt it to
> your needs (freedom 1). Access to the full text of the corpus is a
> precondition for this.
>      * The freedom to redistribute copies so you can help your neighbor
> (freedom 2).
>      * The freedom to add texts or metadata or annotations, and release
> your improvements to the public, so that the whole community benefits
> (freedom 3). 

Regarding "freedom 3" (the last point), there can be one important
difference between corpora and software.  For many kinds of corpus
research, it's possible to circulate metadata and annotations in
"stand-off" form: instead of including the corpus data with the
annotations, you include indexing information (file name, document ID, 
byte offset, etc) that cites a reference release of the corpus data.

Obviously, the only people who can make use of stand-off annotations are
those who already have or can get "freedom 1" (access to full text) for the
given corpus.  (Or maybe there are ways to make these annotations work for
people who only have "freedom 0"?)

In any case, many researchers can contribute to the community in this way,
and many others can benefit, without risking property-rights infringements:
given that the annotations do not contain a replication of the corpus,
ownership of the annotations (and the choice of whether/how to distribute
them) resides with the annotation creator, and is not limited in any direct
way by the distribution constraints of the corpus.

	David Graff



More information about the Corpora mailing list