[Corpora-List] Using version control software in corpus construction
Steven Bird
stevenbird1 at gmail.com
Mon Mar 29 01:48:10 UTC 2010
On 29 March 2010 03:30, Rob Malouf <rmalouf at mail.sdsu.edu> wrote:
> We used version control while building the Alpino corpus/treebank. It works very well as long as your data and annotations is stored in a text-ish format (like XML). Version control doesn't work especially well with binary files -- it'll keep track of the latest versions, but it can't track or merge individual changes.
Note that NLTK stores its corpora in svn, in binary format:
http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml
Its not too wasteful, given that we don't actually curate the content
of any corpora. The externally hosted revision control provides a
stable way for people to reference previous distributions. Disk space
is not an issue, since we only include samples in the case of large
corpora (like TIMIT or Europarl).
-Steven Bird
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list