[Corpora-List] Using version control software in corpus construction

Steven Bird stevenbird1 at gmail.com
Mon Mar 29 01:48:10 UTC 2010


On 29 March 2010 03:30, Rob Malouf <rmalouf at mail.sdsu.edu> wrote:
> We used version control while building the Alpino corpus/treebank.  It works very well as long as your data and annotations is stored in a text-ish format (like XML).  Version control doesn't work especially well with binary files -- it'll keep track of the latest versions, but it can't track or merge individual changes.

Note that NLTK stores its corpora in svn, in binary format:

    http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml

Its not too wasteful, given that we don't actually curate the content
of any corpora.  The externally hosted revision control provides a
stable way for people to reference previous distributions.  Disk space
is not an issue, since we only include samples in the case of large
corpora (like TIMIT or Europarl).

-Steven Bird

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list