[Corpora-List] Using version control software in corpus construction

liu chang liuchangjohn at gmail.com
Mon Mar 29 19:32:08 UTC 2010


On Mon, Mar 29, 2010 at 2:19 AM, Joerg Tiedemann
<jorg.tiedemann at lingfil.uu.se> wrote:
>
> This is funny, we also just started discussing the use of version control
> systems for a newly started project on data sharing and model building for
> machine translation (http://www.letsmt.eu/). Managing revisions seems to be
> very useful for such a collaborative initiative. However, we haven't started
> implementing our repository yet and we also would like to know about any
> experience with large-scale data files in SVN or related systems. Here are
> some questions we like to answer:
>
>
> * Is it possible to compress internal files in SVN or other systems? (What I
> mean is that SVN would take care of compression of the internal files in the
> repository but check-in/check-out works with plain text files)

SVN uses zlib to compress all data in your repository. However, your
working copy is almost always exactly twice the size of the current
revision. It's quite safe to say that all other modern common version
control systems (git, mercurial..) are more space efficient than SVN,
if that matters to you.

> * Is it possible to remove specific revisions or even to restrict the
> history to a specific number of revisions? (but I'm not sure if this would
> be a good idea anyway)

As far as I know, the only way to do that is to dump an SVN repository
as a plain file, manually remove the revision data, and re-import the
edited file. Yikes.

> * How efficient is check-in/check-out for large repositories/files?

Despite what many people say on the internet, we found git to have
performance problems when you have many large files (>5 GB or so).
Lots of memory and CPU time required for check-in and check-out. SVN
seems to be more `chatty' than git in general, but has no problems
dealing with files of practically any size.

> Any insides/hints (also about other issues) would be much appreciated.

After toying with git for a while, our lab reverted back to SVN. One
reason is the performance problems for large files mentioned above;
the other is that SVN is the only one allowing you to check out, work
on, and check in a partial repository. We found this to be very
valuable when you have a large repository and different members only
need to work on different parts.

Hope that helps.
Liu Chang

> PS: We will be looking for (data) contributions soon ....
>
>
> On 3/28/10 7:14 PM, Piotr Bański wrote:
>>
>> One thing that version control gives you that has not been mentioned so
>> far is that it makes it easy to define the state of the corpus as it was
>> at the moment you performed calculations that you want to be
>> reproducible. Before you perform any measurements, tag the current
>> corpus as a 'development snapshot', and it will always be possible to go
>> back to it later. This concerns both dynamic/monitor corpora as well as
>> static corpora before any corrections are made to their data and/or
>> annotations.
>>
>> I credit the observation concerning the usefulness (or actually virtual
>> necessity, if empiricism is treated seriously) of 'snapshots' to Henry
>> S. Thompson in a conference discussion earlier this year (though it
>> may/must have been around for some time, I hope...). I'm not sure that
>> he meant this in the sense of 'SVN/CVS/whatnot release tags', but
>> translating it into version-control-speak is a trivial extension of that
>> observation.
>>
>> Best,
>>
>>   Piotr
>>
>> On 2010-03-28 17:20, Hardie, Andrew wrote:
>>>
>>> Hi all,
>>>
>>> I am contemplating using a source-code version control system (such as
>>> Subversion) to store the files of a corpus as it is being constructed,
>>> (a) to help keep track of changes as I go, (b) to allow several people
>>> to work on it in a non-confusing way and (c) to simplify backing up and
>>> aid data security.
>>>
>>> Using version control software occurred to me after spending some time
>>> manually keeping track of a set of encoding and markup changes in an
>>> older corpus, and finding it a total pain in the neck. Of course, this
>>> is not exactly what version control software is designed for...
>>>
>>> I was wondering, has anyone on the list done this before? If so, are
>>> there any pitfalls to avoid / particular pointers I should be aware of?
>>> Or alternative (better) ways of accomplishing the same thing?
>>>
>>> All hints and tips gratefully received.
>>>
>>> Best
>>>
>>> Andrew.
>>>
>>>
>>>
>>> Andrew Hardie
>>> Department of Linguistics
>>> County South
>>> Lancaster University
>>> Lancaster LA1 4YL
>>> United Kingdom
>>>
>>> a.hardie at lancaster.ac.uk
>>>
>>> _______________________________________________
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>>
>>
>>
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list