[Corpora-List] Using version control software in corpus construction

Jin-Dong Kim jdkim at is.s.u-tokyo.ac.jp
Mon Mar 29 01:44:00 UTC 2010


We also used CVS for the development of the GENIA corpus and annotations to it.

For supporting the manual annotation, we developed an XML-based
annotation system which we called XConc. Since XConc was developed as
a plugin to Eclipse, it was well integrated with the CVS client on
Eclipse.

One thing I think we'd better to be aware of to use CVS efficiently is
that it's version control functionality relies on 'diff'. As you may
already know, 'diff' finds changes of a text file line-by-line. In
other words, it lets you know which lines of a text file have changed.
So, a worst way of using CVS for corpus development would be keeping a
whole document in one line, which sometimes happens when we use XML.
Then, CVS will just let you know whether there was a change in the
document or not, without detailed information about the location of
changes.

If your annotation will go to words, I would recommend to keep only
one word in one line, to keep track of the word-by-word changes. If it
will go to sentences, then keeping a sentence in one line would be
reasonable.

Best,

Jin-Dong

-----
Jin-Dong Kim, Ph.D,
Project Lecturer,
University of Tokyo



On Mon, Mar 29, 2010 at 6:00 AM, Serge Heiden <slh at ens-lyon.fr> wrote:
> Andrew,
>
> Some french projects use version control for their corpora source files
> for the reasons you mentionned.
> Several use version control through the Eclipse SVN plugin integrated in
> the Millefeuille XML editing platform :
> http://ralyx.inria.fr/2008/Raweb/aviz/uid45.html
> Others use the Oxygen XML editor integrated SVN client :
> http://www.oxygenxml.com/doc/ug-oxygen/svn-client.html
> Concerning usage of XML version control,  I recall an old (2003) thread
> in the TEI-L mailing list about XML diff software that could be helpfull :
> http://listserv.brown.edu/archives/cgi-bin/wa?A2=ind0305&L=TEI-L&P=R1880
> Oxygen has good xml diff support now :
> http://www.oxygenxml.com/doc/ug-oxygen/file-comparison.html
> If you plan to use Subversion, Syd Bauman has written a XSLT stylesheet
> that could be helpfull :
> http://wiki.tei-c.org/index.php/Extract-svn-id.xslt
> The eXist XML database could help as a backend for versioning :
> http://exist.sourceforge.net/versioning.html
> But I haven't used it myself.
>
> Best,
> Serge
>
> Selon Hardie, Andrew le 28/03/2010 17:20:
>>
>> Hi all,
>>
>> I am contemplating using a source-code version control system (such as
>> Subversion) to store the files of a corpus as it is being constructed,
>> (a) to help keep track of changes as I go, (b) to allow several people
>> to work on it in a non-confusing way and (c) to simplify backing up and
>> aid data security.
>>
>> Using version control software occurred to me after spending some time
>> manually keeping track of a set of encoding and markup changes in an
>> older corpus, and finding it a total pain in the neck. Of course, this
>> is not exactly what version control software is designed for...
>>
>> I was wondering, has anyone on the list done this before? If so, are
>> there any pitfalls to avoid / particular pointers I should be aware of?
>> Or alternative (better) ways of accomplishing the same thing?
>>
>> All hints and tips gratefully received.
>>
>> Best
>>
>> Andrew.
>>
>>
>>
>> Andrew Hardie
>> Department of Linguistics
>> County South
>> Lancaster University
>> Lancaster LA1 4YL
>> United Kingdom
>>
>> a.hardie at lancaster.ac.uk
>>
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
> --
> Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lsh.fr
> ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
> 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list