[Corpora-List] Using version control software in corpus construction
E.Y.Kow at brighton.ac.uk
E.Y.Kow at brighton.ac.uk
Mon Mar 29 13:05:47 UTC 2010
On Sun, Mar 28, 2010 at 16:20:50 +0100, Hardie, Andrew wrote:
> I am contemplating using a source-code version control system (such as
> Subversion) to store the files of a corpus as it is being constructed,
> (a) to help keep track of changes as I go, (b) to allow several people
> to work on it in a non-confusing way and (c) to simplify backing up and
> aid data security.
Disclaimer: my commentary is biased by my work on the Darcs project
I can't comment on the corpus-construction aspects of this thread, but
I'd like to point out that revision control is easier with a
*distributed* revision control system such as Git/Mercurial/Darcs rather
than a centralised one such as CVS/SVN.
The key point about DVCS is that it gives greater liberty to change
your mind. Distributed version control systems (DVCS) offer you
- Lower static inertia: getting started with a DVCS is super-easy;
no thinking about where to put your repository. Just 'git init'
and you're done.
- Piece of mind: if the central server goes down or is temporarily
out of reach, you can effortlessly talk to another repository
instead.
- Easier collaboration: you can exchange changesets directly between
certain colleagues or feature branches
- Incremental committing: you can commit changes as often as you
need (this is a big change in your workflow! committing is good!).
Users of centralised version control systems tend to unconsciously
resist committing because of the feeling of work in progress.
DVCSes allow you to avoid such risks because you can commit and
modify your changesets at your leisure, only pushing them when
you're good and ready.
Now let's go over some specific systems, including a centralised version
control representative, SVN
Advantages of SVN
-----------------
- Fine grained control of permissions.
For small teams, a good practice is to have liberal permissions here
and enforce boundaries through social processes, ie. we're all adults
here. Otherwise, there may be cases where such control is desirable
- Access to subdirectories of your project.
For now, the Distributed Version Control Systems (DVCS) manage entire
projects and not directories. To my knowledge, there isn't yet as
good way to grab pieces of your tree as there is in CVS/SVN. I think
of this as a minor/open problem
Advantages of Git
-----------------
- Popularity: Git is the most popular DVCS system out there at the
moment, so you stand a very good chance of finding guides, helpful
GUIs, etc.
- Speed: Git is amazingly fast
- Robustness: All objects in Git are stored and associated with a
cryptographic hash. Recovering lost work in Git is generally quite
easy (if you're willing to sit through some tutorials)
- Flexibility: Git seems to be designed largely from the bottom-up,
as a sort of filesystem, with people being able to craft their own
workflows or user interfaces on top of it.
Recommended reading: Visual Git Reference
http://marklodato.github.com/visual-git-guide/
Advantages of Mercurial (Hg)
----------------------------
- Speed and robustness: Hg is about equivalent to Git in terms of
performance. Hg does not work quite the same way was Git (it
stores a "revlog" for each individual file, with incremental
revisions followed by the occasional snapshot).
- Ease of use: Hg has a much better reputation for its UI than
Git does. This may be *particularly* important for corpus-annotation
work if it involves less computer-nerdy types. On the other hand,
if you're going to be using a graphical interface anyway, this may
not make as huge a difference.
Recommended reading:
Joel Spolsky's Mercurial Tutorial -- http://hginit.com/
This looks like a very good easy reference for unlearning the
centralised way of thinking in general
Other notes
-----------
I should also add that as my personal favourite is Darcs, I don't
actually know very much about Git/Hg.
I do *not* recommend Darcs for this particular task (we in the
Darcs Team strongly advise against using Darcs for large projects
as it does not offer the same kind of performance or robustness
as Git/Hg).
Darcs has a lot of flaws, but it offers some unique advantages over its
brethren in the DVCS world, particularly a simple patch-based mental
model and a strong emphasis on "cherry picking" (which is slightly
frowned upon in other systems). Basically, I tend to think of Darcs
as being the next step after we've finally convinced everybody to
use a DVCS over a centralised one over not using any version control at
all. But it's definitely a work in progress! If you're curious about
Darcs, I'd suggest trying it out for some smaller personal projects.
You could also ask me about it. :-)
I hope this overview helps. If the other advice on this list fits into
my mostly tech-focused notes, then I think a good bet would be to use
Mercurial.
Cheers,
--
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
PGP Key ID: 08AC04F9
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100329/06792e9d/attachment-0001.sig>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list