[Corpora-List] Using version control software in corpus construction

Piotr Bański bansp at o2.pl
Sun Mar 28 17:11:44 UTC 2010


One thing that version control gives you that has not been mentioned so
far is that it makes it easy to define the state of the corpus as it was
at the moment you performed calculations that you want to be
reproducible. Before you perform any measurements, tag the current
corpus as a 'development snapshot', and it will always be possible to go
back to it later. This concerns both dynamic/monitor corpora as well as
static corpora before any corrections are made to their data and/or
annotations.

I credit the observation concerning the usefulness (or actually virtual
necessity, if empiricism is treated seriously) of 'snapshots' to Henry
S. Thompson in a conference discussion earlier this year (though it
may/must have been around for some time, I hope...). I'm not sure that
he meant this in the sense of 'SVN/CVS/whatnot release tags', but
translating it into version-control-speak is a trivial extension of that
observation.

Best,

  Piotr

On 2010-03-28 17:20, Hardie, Andrew wrote:
> Hi all,
> 
> I am contemplating using a source-code version control system (such as
> Subversion) to store the files of a corpus as it is being constructed,
> (a) to help keep track of changes as I go, (b) to allow several people
> to work on it in a non-confusing way and (c) to simplify backing up and
> aid data security.
> 
> Using version control software occurred to me after spending some time
> manually keeping track of a set of encoding and markup changes in an
> older corpus, and finding it a total pain in the neck. Of course, this
> is not exactly what version control software is designed for...
> 
> I was wondering, has anyone on the list done this before? If so, are
> there any pitfalls to avoid / particular pointers I should be aware of?
> Or alternative (better) ways of accomplishing the same thing?
> 
> All hints and tips gratefully received.
> 
> Best
> 
> Andrew.
> 
> 
> 
> Andrew Hardie
> Department of Linguistics
> County South
> Lancaster University
> Lancaster LA1 4YL
> United Kingdom
>  
> a.hardie at lancaster.ac.uk
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
> 


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list