[Corpora-List] Using version control software in corpus construction

Joerg Tiedemann jorg.tiedemann at lingfil.uu.se
Sun Mar 28 18:19:17 UTC 2010


This is funny, we also just started discussing the use of version 
control systems for a newly started project on data sharing and model 
building for machine translation (http://www.letsmt.eu/). Managing 
revisions seems to be very useful for such a collaborative initiative. 
However, we haven't started implementing our repository yet and we also 
would like to know about any experience with large-scale data files in 
SVN or related systems. Here are some questions we like to answer:


* Is it possible to compress internal files in SVN or other systems? 
(What I mean is that SVN would take care of compression of the internal 
files in the repository but check-in/check-out works with plain text files)

* Is it possible to remove specific revisions or even to restrict the 
history to a specific number of revisions? (but I'm not sure if this 
would be a good idea anyway)

* How efficient is check-in/check-out for large repositories/files?


Any insides/hints (also about other issues) would be much appreciated.


Jörg


PS: We will be looking for (data) contributions soon ....


On 3/28/10 7:14 PM, Piotr Bański wrote:
> One thing that version control gives you that has not been mentioned so
> far is that it makes it easy to define the state of the corpus as it was
> at the moment you performed calculations that you want to be
> reproducible. Before you perform any measurements, tag the current
> corpus as a 'development snapshot', and it will always be possible to go
> back to it later. This concerns both dynamic/monitor corpora as well as
> static corpora before any corrections are made to their data and/or
> annotations.
>
> I credit the observation concerning the usefulness (or actually virtual
> necessity, if empiricism is treated seriously) of 'snapshots' to Henry
> S. Thompson in a conference discussion earlier this year (though it
> may/must have been around for some time, I hope...). I'm not sure that
> he meant this in the sense of 'SVN/CVS/whatnot release tags', but
> translating it into version-control-speak is a trivial extension of that
> observation.
>
> Best,
>
>    Piotr
>
> On 2010-03-28 17:20, Hardie, Andrew wrote:
>> Hi all,
>>
>> I am contemplating using a source-code version control system (such as
>> Subversion) to store the files of a corpus as it is being constructed,
>> (a) to help keep track of changes as I go, (b) to allow several people
>> to work on it in a non-confusing way and (c) to simplify backing up and
>> aid data security.
>>
>> Using version control software occurred to me after spending some time
>> manually keeping track of a set of encoding and markup changes in an
>> older corpus, and finding it a total pain in the neck. Of course, this
>> is not exactly what version control software is designed for...
>>
>> I was wondering, has anyone on the list done this before? If so, are
>> there any pitfalls to avoid / particular pointers I should be aware of?
>> Or alternative (better) ways of accomplishing the same thing?
>>
>> All hints and tips gratefully received.
>>
>> Best
>>
>> Andrew.
>>
>>
>>
>> Andrew Hardie
>> Department of Linguistics
>> County South
>> Lancaster University
>> Lancaster LA1 4YL
>> United Kingdom
>>
>> a.hardie at lancaster.ac.uk
>>
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list