[Corpora-List] Using version control software in corpus construction

E.Y.Kow at brighton.ac.uk E.Y.Kow at brighton.ac.uk
Mon Mar 29 13:05:47 UTC 2010


On Sun, Mar 28, 2010 at 16:20:50 +0100, Hardie, Andrew wrote:
> I am contemplating using a source-code version control system (such as
> Subversion) to store the files of a corpus as it is being constructed,
> (a) to help keep track of changes as I go, (b) to allow several people
> to work on it in a non-confusing way and (c) to simplify backing up and
> aid data security.

Disclaimer: my commentary is biased by my work on the Darcs project

I can't comment on the corpus-construction aspects of this thread, but
I'd like to point out that revision control is easier with a
*distributed* revision control system such as Git/Mercurial/Darcs rather
than a centralised one such as CVS/SVN.

The key point about DVCS is that it gives greater liberty to change
your mind.  Distributed version control systems (DVCS) offer you

  - Lower static inertia: getting started with a DVCS is super-easy;
    no thinking about where to put your repository.  Just 'git init'
    and you're done.

  - Piece of mind: if the central server goes down or is temporarily
    out of reach, you can effortlessly talk to another repository
    instead.

  - Easier collaboration: you can exchange changesets directly between
    certain colleagues or feature branches

  - Incremental committing: you can commit changes as often as you
    need (this is a big change in your workflow! committing is good!).
    Users of centralised version control systems tend to unconsciously
    resist committing because of the feeling of work in progress.
    DVCSes allow you to avoid such risks because you can commit and
    modify your changesets at your leisure, only pushing them when
    you're good and ready.

Now let's go over some specific systems, including a centralised version
control representative, SVN

Advantages of SVN
-----------------
- Fine grained control of permissions.

  For small teams, a good practice is to have liberal permissions here
  and enforce boundaries through social processes, ie. we're all adults
  here.  Otherwise, there may be cases where such control is desirable

- Access to subdirectories of your project.

  For now, the Distributed Version Control Systems (DVCS) manage entire
  projects and not directories.  To my knowledge, there isn't yet as
  good way to grab pieces of your tree as there is in CVS/SVN.  I think
  of this as a minor/open problem

Advantages of Git
-----------------
- Popularity: Git is the most popular DVCS system out there at the
  moment, so you stand a very good chance of finding guides, helpful
  GUIs, etc.

- Speed: Git is amazingly fast

- Robustness: All objects in Git are stored and associated with a
  cryptographic hash.  Recovering lost work in Git is generally quite
  easy (if you're willing to sit through some tutorials)

- Flexibility: Git seems to be designed largely from the bottom-up,
  as a sort of filesystem, with people being able to craft their own
  workflows or user interfaces on top of it.

Recommended reading: Visual Git Reference
                     http://marklodato.github.com/visual-git-guide/

Advantages of Mercurial (Hg)
----------------------------
- Speed and robustness: Hg is about equivalent to Git in terms of
  performance.  Hg does not work quite the same way was Git (it
  stores a "revlog" for each individual file, with incremental
  revisions followed by the occasional snapshot).

- Ease of use: Hg has a much better reputation for its UI than
  Git does.  This may be *particularly* important for corpus-annotation
  work if it involves less computer-nerdy types.  On the other hand,
  if you're going to be using a graphical interface anyway, this may
  not make as huge a difference.

Recommended reading:
 Joel Spolsky's Mercurial Tutorial -- http://hginit.com/
 This looks like a very good easy reference for unlearning the
 centralised way of thinking in general

Other notes
-----------
I should also add that as my personal favourite is Darcs, I don't
actually know very much about Git/Hg.

I do *not* recommend Darcs for this particular task (we in the
Darcs Team strongly advise against using Darcs for large projects
as it does not offer the same kind of performance or robustness
as Git/Hg).

Darcs has a lot of flaws, but it offers some unique advantages over its
brethren in the DVCS world, particularly a simple patch-based mental
model and a strong emphasis on "cherry picking" (which is slightly
frowned upon in other systems).  Basically, I tend to think of Darcs
as being the next step after we've finally convinced everybody to
use a DVCS over a centralised one over not using any version control at
all.  But it's definitely a work in progress!  If you're curious about
Darcs, I'd suggest trying it out for some smaller personal projects.
You could also ask me about it. :-)

I hope this overview helps.  If the other advice on this list fits into
my mostly tech-focused notes, then I think a good bet would be to use
Mercurial.

Cheers,

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
PGP Key ID: 08AC04F9
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100329/06792e9d/attachment-0001.sig>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list