[Corpora-List] Using version control software in corpus construction

Rob Malouf rmalouf at mail.sdsu.edu
Sun Mar 28 16:30:07 UTC 2010


We used version control while building the Alpino corpus/treebank.  It works very well as long as your data and annotations is stored in a text-ish format (like XML).  Version control doesn't work especially well with binary files -- it'll keep track of the latest versions, but it can't track or merge individual changes. 

We used CVS, but there are better options around now.  Subversion seems to be popular, though I'm intrigued by the distributed version control systems like git.  

--
Rob Malouf <rmalouf at mail.sdsu.edu>
Department of Linguistics and Asian / Middle Eastern Languages
San Diego State University

On Mar 28, 2010, at 8:20 AM, Hardie, Andrew wrote:

> Hi all,
> 
> I am contemplating using a source-code version control system (such as
> Subversion) to store the files of a corpus as it is being constructed,
> (a) to help keep track of changes as I go, (b) to allow several people
> to work on it in a non-confusing way and (c) to simplify backing up and
> aid data security.
> 
> Using version control software occurred to me after spending some time
> manually keeping track of a set of encoding and markup changes in an
> older corpus, and finding it a total pain in the neck. Of course, this
> is not exactly what version control software is designed for...
> 
> I was wondering, has anyone on the list done this before? If so, are
> there any pitfalls to avoid / particular pointers I should be aware of?
> Or alternative (better) ways of accomplishing the same thing?
> 
> All hints and tips gratefully received.
> 
> Best
> 
> Andrew.
> 
> 
> 
> Andrew Hardie
> Department of Linguistics
> County South
> Lancaster University
> Lancaster LA1 4YL
> United Kingdom
> 
> a.hardie at lancaster.ac.uk
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
> 


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list