[Corpora-List] Using version control software in corpus construction

Myrosia Dzikovska mdzikovs at inf.ed.ac.uk
Mon Mar 29 15:10:40 UTC 2010


We have done several corpora under either CVS or subversion control (not 
public domain yet), using the NXT, which are xml-based, stand-off corpus 
formats (with a query language that allows for easy querying later)

http://groups.inf.ed.ac.uk/nxt/index.shtml

We found that in general it works  well, but there are several things to be 
careful about

First, for us stand-off corpus format was essential, because diff (used by cvs 
and svn) is not aware of xml formatting, and if two people work on the same 
file, the results can be unpredictable. The worst problems happened when cvs 
thought it merged things safely, and we ended up with duplicated nodes 
because annotators were moving things around (annotation involved splitting 
and merging segment topic annotations). With stand-off, it was mostly safe to 
edit two different types of annotations in parallel.

The one danger of stand-off formats proved to be broken links. If file A 
points to file B, one annotator changes A and another changes B in parallel, 
there is a risk that you will end up with broken links from B to A, cvs 
merges didn't always do the right thing.

But, even with those reservations, as long as we were disciplined about making 
sure that related annotation files are not being changed at the same time, 
using CVS was very helpful - we were able to do different annotations in 
parallel, and were able to recover the state of the corpus at different 
times, based on tags, in order to replicate experiments. 

Myrosia

On Sunday 28 March 2010 16:20, Hardie, Andrew wrote:
> Hi all,
>
> I am contemplating using a source-code version control system (such as
> Subversion) to store the files of a corpus as it is being constructed,
> (a) to help keep track of changes as I go, (b) to allow several people
> to work on it in a non-confusing way and (c) to simplify backing up and
> aid data security.
>
> Using version control software occurred to me after spending some time
> manually keeping track of a set of encoding and markup changes in an
> older corpus, and finding it a total pain in the neck. Of course, this
> is not exactly what version control software is designed for...
>
> I was wondering, has anyone on the list done this before? If so, are
> there any pitfalls to avoid / particular pointers I should be aware of?
> Or alternative (better) ways of accomplishing the same thing?
>
> All hints and tips gratefully received.
>
> Best
>
> Andrew.
>
>

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list