[Corpora-List] Using version control software in corpus construction
Myrosia Dzikovska
mdzikovs at inf.ed.ac.uk
Mon Mar 29 15:10:40 UTC 2010
We have done several corpora under either CVS or subversion control (not
public domain yet), using the NXT, which are xml-based, stand-off corpus
formats (with a query language that allows for easy querying later)
http://groups.inf.ed.ac.uk/nxt/index.shtml
We found that in general it works well, but there are several things to be
careful about
First, for us stand-off corpus format was essential, because diff (used by cvs
and svn) is not aware of xml formatting, and if two people work on the same
file, the results can be unpredictable. The worst problems happened when cvs
thought it merged things safely, and we ended up with duplicated nodes
because annotators were moving things around (annotation involved splitting
and merging segment topic annotations). With stand-off, it was mostly safe to
edit two different types of annotations in parallel.
The one danger of stand-off formats proved to be broken links. If file A
points to file B, one annotator changes A and another changes B in parallel,
there is a risk that you will end up with broken links from B to A, cvs
merges didn't always do the right thing.
But, even with those reservations, as long as we were disciplined about making
sure that related annotation files are not being changed at the same time,
using CVS was very helpful - we were able to do different annotations in
parallel, and were able to recover the state of the corpus at different
times, based on tags, in order to replicate experiments.
Myrosia
On Sunday 28 March 2010 16:20, Hardie, Andrew wrote:
> Hi all,
>
> I am contemplating using a source-code version control system (such as
> Subversion) to store the files of a corpus as it is being constructed,
> (a) to help keep track of changes as I go, (b) to allow several people
> to work on it in a non-confusing way and (c) to simplify backing up and
> aid data security.
>
> Using version control software occurred to me after spending some time
> manually keeping track of a set of encoding and markup changes in an
> older corpus, and finding it a total pain in the neck. Of course, this
> is not exactly what version control software is designed for...
>
> I was wondering, has anyone on the list done this before? If so, are
> there any pitfalls to avoid / particular pointers I should be aware of?
> Or alternative (better) ways of accomplishing the same thing?
>
> All hints and tips gratefully received.
>
> Best
>
> Andrew.
>
>
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list