[Corpora-List] Using version control software in corpus construction

iain iain at idcl.co.uk
Mon Mar 29 13:00:55 UTC 2010


One thing to be aware of is that the diff tools of program source version
control (as discussed in replies) may not give good results with text which
is essentially xml based.

For example changing the structure of an xml file may seem like a simple
change to a human (moving a branch to another location), but a conventional
diff will see it as a large scale text edit, obscuring what's actually
happening.  This also has the effect of making the deltas much larger than
they need be, though this is scarcely a concern with the cheapness of modern
storage.

There are a number of XML diff tools which work variously well.


Iain

-----Original Message-----
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
Hardie, Andrew
Sent: 28 March 2010 16:21
To: corpora at uib.no
Subject: [Corpora-List] Using version control software in corpus
construction

Hi all,

I am contemplating using a source-code version control system (such as
Subversion) to store the files of a corpus as it is being constructed,
(a) to help keep track of changes as I go, (b) to allow several people
to work on it in a non-confusing way and (c) to simplify backing up and
aid data security.

Using version control software occurred to me after spending some time
manually keeping track of a set of encoding and markup changes in an
older corpus, and finding it a total pain in the neck. Of course, this
is not exactly what version control software is designed for...

I was wondering, has anyone on the list done this before? If so, are
there any pitfalls to avoid / particular pointers I should be aware of?
Or alternative (better) ways of accomplishing the same thing?

All hints and tips gratefully received.

Best

Andrew.



Andrew Hardie
Department of Linguistics
County South
Lancaster University
Lancaster LA1 4YL
United Kingdom
 
a.hardie at lancaster.ac.uk

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list