[Corpora-List] Using version control software in corpus construction

Serge Heiden slh at ens-lyon.fr
Mon Mar 29 07:18:02 UTC 2010


Selon chris brew le 29/03/2010 05:37:
>  I also agree with the implicit
>  suggestion that keeping markup and text in the same file is not
>  always the best idea.

In our projects, one factor to organize the corpus architecture
is to try to separate the parts that change the most often from the parts
that don't change much (for example several tags - from different
taggers and tag sets - from the surface of texts in NLP projects).
For this, we use various XML standoff annotations techniques.
We also use the one word by line technique for some part of
our workflows (aka IMS CWB source format).

>  it is crucial to document the format as well as you are able,
>  and make clear statements about what the annotations are supposed to
>  mean.

We use the guidelines of, and participate to, the Text Encoding Initiative
(TEI) community : http://www.tei-c.org, which documents corpora sources
for that exact purpose since 1994.
If you feel NLP data is not very well represented in that standard, you
are welcome to propose new encodings and discuss their adoption in the
annual update of the guidelines.
For example, we are in a process of proposing new encodings to
document all the history of the various command line tools that were
called during the preparation of a corpus (tokenizers and their
parameters, taggers, etc.). We would like our tools to be able to read
that history for their own processing needs.
Documenting is a must, but sharing that documentation between persons
and softwares is a must also.

--Serge Heiden

-- 
Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lsh.fr
ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100329/be91b46a/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list