[Corpora-List] Using version control software in corpus construction

Rich Cooper rich at englishlogickernel.com
Sun Mar 28 16:21:26 UTC 2010


Version control is for text files that change during development.  If you
put all your markup information into the text file with the actual text, it
would be encoding your versions of markup as well as your corpora of
phrases.  

It is more functional to put the corpus into a database where you can query
for verbs, nouns, anaphora, and so forth based the interpretation you assign
to each word or phrase.  

Your markup notes can be represented by database columns to distinguish
among instances of the phrase, word, markup or concept throughout the
corpus.  With a database substrate, you can most flexibly adjust your
interpretations as new information arrives. 

Further, when you expect the corpus to grow, incremental changes in your
markup notation, your interpretations of phrases, and other long or short
term theories about the text can best be modeled and synchronized using a
database.  

For more information about how a database can manage unstructured text and
the analyses you make of it, see:

http://www.englishlogickernel.com/Patent-7-209-923-B1.PDF
 
and for a specific database of both structured and unstructured texts which
use this method, see the database used in:

http://www.englishlogickernel.com/Pat20090070317.pdf

The above link is a great example of databases containing both structured
and unstructured columns organized into an easy to process collection of
corpora.  

JMHO,
-Rich

Sincerely,
Rich Cooper
EnglishLogicKernel.com
Rich AT EnglishLogicKernel DOT com

-----Original Message-----
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
Hardie, Andrew
Sent: Sunday, March 28, 2010 8:21 AM
To: corpora at uib.no
Subject: [Corpora-List] Using version control software in corpus
construction

Hi all,

I am contemplating using a source-code version control system (such as
Subversion) to store the files of a corpus as it is being constructed,
(a) to help keep track of changes as I go, (b) to allow several people
to work on it in a non-confusing way and (c) to simplify backing up and
aid data security.

Using version control software occurred to me after spending some time
manually keeping track of a set of encoding and markup changes in an
older corpus, and finding it a total pain in the neck. Of course, this
is not exactly what version control software is designed for...

I was wondering, has anyone on the list done this before? If so, are
there any pitfalls to avoid / particular pointers I should be aware of?
Or alternative (better) ways of accomplishing the same thing?

All hints and tips gratefully received.

Best

Andrew.



Andrew Hardie
Department of Linguistics
County South
Lancaster University
Lancaster LA1 4YL
United Kingdom
 
a.hardie at lancaster.ac.uk

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list