[Corpora-List] Using version control software in corpus construction

Mon Mar 29 03:37:13 UTC 2010

On Sun, Mar 28, 2010 at 12:21 PM, Rich Cooper
<rich at englishlogickernel.com> wrote:
> Version control is for text files that change during development.  If you
> put all your markup information into the text file with the actual text, it
> would be encoding your versions of markup as well as your corpora of
> phrases.

I agree that version control is highly suitable for text files that
change during
a development process. I also agree with the implicit suggestion that keeping
markup and text in the same file is not always the best idea.

In addition, publicly accessible repositories, under version control, are a very
good way of ensuring that a wider user community can access the whole
development
history of a corpus. This is valuable, because any corpus that is
worth its salt will
be used for studies at various stages during its development, and we
should want to
retain reproducibility even when a newer version comes along.

Public repositories are also desirable because it is generally good
for possible imperfections
in the corpus to be exposed to as many people as possible. Corpus
developers, like
software developers, should be keen for their bugs to be fixed by
others. This is a robust
finding for software. Many software developers are now accustomed to
the slightly queasy feeling of
putting stuff out their despite its probable imperfections, and have
found that the benefits of exposure
justify the risk.

This open-source model is not so attractive if you are constrained by
copyright or by institutional
policy to NOT make the corpus fully available in an open-source form.
In that case
you might still want to use version control, but in a private
repository. And perhaps
to agitate for the copyright release or change in institutional policy
that would allow
you to fully benefit from the help of others.

I'm neutral on whether the format of the corpus should be defined with
XML schemas, SQL, or something else,
but insistent on the merits of defining it in some way that is
amenable to automated checking, and available
for extension and modification by others. It isn't necessarily crucial
to get everything about the format right
from the outset, it is crucial to document the format as well as you
are able, and make clear statements about
what the annotations are supposed to mean. The fact that LDC did this
documentation task well with the Penn Treebank is the reason why
others have been able to use, extend and transform it in all kinds of
interesting ways.
And also to find errors and inconsistencies in it. If we didn't know
what the data was supposed to be like, we'd have no chance of telling
when errors were happening.

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora