Andrew, <br>As an addition to a couple of the responses already (Chris Brew's &  Piotr's in particular), <br>I'd like to add that for public distribution and version controlling,<br>SourceForge has been a useful tool for our corpora.<br>


<br>First, SourceForge keeps all the versions of the corpora that we distribute.<br>When we publish on the use one of our corpora, we include the version number,<br>which is then freely available to the public and preserved,<br>


even if the corpus changes over time.<br><br>Also, all SourceForge projects come with Bug Tracker functionality that can be used<br>by the user community for alerting the corpus creators (us, in this case) <br>of mistakes or inconsistencies in a corpus that need to be fixed, <br>


as well as serving record of the corpus changes that we maintain.<br><br>To learn more about this and see it in action, <br>poke around on <a href="http://bionlp-corpora.sourceforge.net">bionlp-corpora.sourceforge.net</a> site.<br>


<br>-Helen Johnson<br><br><br><br>------------------------------<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">


<br>

Message: 10<br>

Date: Sun, 28 Mar 2010 23:37:13 -0400<br>

From: chris brew <<a href="mailto:cbrew@acm.org">cbrew@acm.org</a>><br>

Subject: Re: [Corpora-List] Using version control software in corpus<br>

        construction<br>

To: Rich Cooper <<a href="mailto:rich@englishlogickernel.com">rich@englishlogickernel.com</a>><br>

Cc: <a href="mailto:CORPORA@uib.no">CORPORA@uib.no</a><br>

<br>

On Sun, Mar 28, 2010 at 12:21 PM, Rich Cooper<br>

<<a href="mailto:rich@englishlogickernel.com">rich@englishlogickernel.com</a>> wrote:<br>

> Version control is for text files that change during development.  If you<br>

> put all your markup information into the text file with the actual text, it<br>

> would be encoding your versions of markup as well as your corpora of<br>

> phrases.<br>

<br>

I agree that version control is highly suitable for text files that<br>

change during<br>

a development process. I also agree with the implicit suggestion that keeping<br>

markup and text in the same file is not always the best idea.<br>

<br>

<br>

In addition, publicly accessible repositories, under version control, are a very<br>

good way of ensuring that a wider user community can access the whole<br>

development<br>

history of a corpus. This is valuable, because any corpus that is<br>

worth its salt will<br>

be used for studies at various stages during its development, and we<br>

should want to<br>

retain reproducibility even when a newer version comes along.<br>

<br>

Public repositories are also desirable because it is generally good<br>

for possible imperfections<br>

in the corpus to be exposed to as many people as possible. Corpus<br>

developers, like<br>

software developers, should be keen for their bugs to be fixed by<br>

others. This is a robust<br>

finding for software. Many software developers are now accustomed to<br>

the slightly queasy feeling of<br>

putting stuff out their despite its probable imperfections, and have<br>

found that the benefits of exposure<br>

justify the risk.<br>

<br>

This open-source model is not so attractive if you are constrained by<br>

copyright or by institutional<br>

policy to NOT make the corpus fully available in an open-source form.<br>

In that case<br>

you might still want to use version control, but in a private<br>

repository. And perhaps<br>

to agitate for the copyright release or change in institutional policy<br>

that would allow<br>

you to fully benefit from the help of others.<br>

<br>

I'm neutral on whether the format of the corpus should be defined with<br>

XML schemas, SQL, or something else,<br>

but insistent on the merits of defining it in some way that is<br>

amenable to automated checking, and available<br>

for extension and modification by others. It isn't necessarily crucial<br>

to get everything about the format right<br>

from the outset, it is crucial to document the format as well as you<br>

are able, and make clear statements about<br>

what the annotations are supposed to mean. The fact that LDC did this<br>

documentation task well with the Penn Treebank is the reason why<br>

others have been able to use, extend and transform it in all kinds of<br>

interesting ways.<br>

And also to find errors and inconsistencies in it. If we didn't know<br>

what the data was supposed to be like, we'd have no chance of telling<br>

when errors were happening.<br>

<br>

<br>

<br>

------------------------------<br>

<br>

Message: 11<br>

Date: Mon, 29 Mar 2010 15:29:10 +0900<br>

From: Takehito UTSURO <<a href="mailto:utsuro@iit.tsukuba.ac.jp">utsuro@iit.tsukuba.ac.jp</a>><br>

Subject: [Corpora-List] CFP: COLING2010 Workshop: NLPIX2010<br>

To: <a href="mailto:corpora@uib.no">corpora@uib.no</a><br>

<br>

***** Apologies if you receive multiple copies of this CFP. *****<br>

<br>

==================================================================<br>

<br>

                     CALL FOR PAPERS<br>

                    COLING2010 Workshop<br>

<br>

   The Second International Workshop on NLP Challenges<br>

      in the Information Explosion Era (NLPIX 2010)<br>

 -- Large-scale and sharable NLP infrastructures and beyond --<br>

<br>

              Beijing, China, August, 28, 2010<br>

<br>

<br>

Workshop Web Site: <a href="http://nlp.kuee.kyoto-u.ac.jp/NLPIX2010/index.html" target="_blank">http://nlp.kuee.kyoto-u.ac.jp/NLPIX2010/index.html</a><br>

<br>

In Cooperation With: Info-plosion<br>

<a href="http://www.infoplosion.nii.ac.jp/info-plosion/ctr.php/m/IndexEng/a/Index/" target="_blank">http://www.infoplosion.nii.ac.jp/info-plosion/ctr.php/m/IndexEng/a/Index/</a><br>

<br>

<br>

==================================================================<br>

Submission deadline: May 30, 2010<br>

==================================================================<br>

<br>

[Workshop Theme and Topics]<br>

<br>

A long-standing problem in Natural Language Processing has been a lack<br>

of large-scale knowledge for computers. The emergence of the Web and<br>

the rapid increase of information on the Web brought us to what could<br>

be called the "information explosion era," and drastically changed the<br>

environment of NLP. The Web is not only a marvelous target for NLP,<br>

but also a valuable resource from which knowledge could be extracted<br>

for computers. Motivated by the desire to have a very first<br>

opportunity to discuss early approaches to those issues and to share<br>

the state-of-the-art technologies at that time, the first<br>

International Workshop on NLP Challenges in the Information Explosion<br>

Era (NLPIX 2008) was successfully held in conjunction with WWW 2008 in<br>

Beijing. The aim of the second workshop of the series of International<br>

Workshop NLPIX is to bring researchers and practitioners together in<br>

order to discuss large-scale and sharable NLP infrastructures, and<br>

furthermore to discuss emerging NEW issues beyond them. Possible<br>

topics of the paper submissions include, but are not limited to:<br>

<br>

  * Construction of large corpora (crawling, preprocessing)<br>

  * Sharable large resources (e.g., Google N-gram statistics, etc.)<br>

  * Standard for a linguistic annotation framework<br>

  * Knowledge acquisition from very large corpora<br>

  * Bootstrapping approach for knowledge acquisition<br>

  * Large scale text mining based on shallow/deep NLP<br>

  * Managing and sharing acquired knowledge<br>

  * Exploitation of acquired knowledge for real applications<br>

  * Knowledge-based information access, analysis, and organization<br>

  * High performance/parallel computing environment for NLP<br>

  * Cloud computing for NLP<br>

<br>

In particular, we solicit the papers that aim at fulfilling a NOVEL<br>

type of needs in Web access and that can provide a new insight into<br>

future directions of Web access research.<br>

<br>

<br>

[Workshop Schedule/Important Dates]<br>

<br>

 * Submission deadline: May 30, 2010<br>

 * Notification of acceptance: June 30, 2010<br>

 * Workshop date: August 28, 2010<br>

<br>

[Submission Format]<br>

<br>

Paper submissions should follow the COLING 2010 paper submission<br>

policy, including paper format, blind review policy and title and<br>

author format convention<br>

(<a href="http://www.coling-2010.org/SubmissionGuideline.htm" target="_blank">http://www.coling-2010.org/SubmissionGuideline.htm</a>).<br>

Papers should not exceed 10 pages, including references. Middle-sized<br>

papers (e.g., 6-8 pages) are also welcome.  Submission is electronic<br>

using paper submission software. Online submission system will be set<br>

up soon.<br>

<br>

<br>

[Workshop Organizers]<br>

<br>

* Sadao Kurohashi, Kyoto University, Japan<br>

* Takehito Utsuro, University of Tsukuba, Japan<br>

<br>

<br>

[Program Committee]<br>

<br>

* Pushpak Bhattacharyya, IIT, India<br>

* Thorsten Brants, Google, USA<br>

* Eric Villemonte de la Clergerie, INRIA, France<br>

* Atsushi Fujii, Tokyo Institute of Technology, Japan<br>

* Julio Gonzalo, UNED, Spain<br>

* Kentaro Inui, Tohoku University, Japan<br>

* Noriko Kando, NII, Japan<br>

* Daisuke Kawahara, NICT, Japan<br>

* Jun'ichi Kazama, NICT, Japan<br>

* Adam Kilgarriff, Lexical Computing Ltd., UK<br>

* Gary Geunbae Lee, POSTECH, Korea<br>

* Hang Li, Microsoft, China<br>

* Dekang Lin, Google, USA<br>

* Tatsunori Mori, Yokohama National University, Japan<br>

* Satoshi Sekine, New York University, USA<br>

* Kenjiro Taura, University of Tokyo, Japan<br>

* Kentaro Torisawa, NICT, Japan<br>

* Marco Turchi, European Commission - Joint Research Centre, Italy<br>

* Yunqing Xia, The Chinese University of Hong Kong, China<br>

<br>

<br>

[Previous NLPIX Workshop]<br>

<br>

NLP Challenges in the Information Explosion Era (NLPIX 2008),<br>

at WWW2008 in Beijing, China.<br>

<a href="http://www.cl.cs.titech.ac.jp/%7Efujii/NLPIX2008/" target="_blank">http://www.cl.cs.titech.ac.jp/~fujii/NLPIX2008/</a><br>

<br>

<br>

[Contact Us]<br>

<br>

 Email: <a href="mailto:nlpix2010@nlp.kuee.kyoto-u.ac.jp">nlpix2010@nlp.kuee.kyoto-u.ac.jp</a><br>

<br>

<br>

<br>

------------------------------<br>

<br>

Message: 12<br>

Date: Mon, 29 Mar 2010 09:18:02 +0200<br>

From: Serge Heiden <<a href="mailto:slh@ens-lyon.fr">slh@ens-lyon.fr</a>><br>

Subject: Re: [Corpora-List] Using version control software in   corpus<br>

        construction<br>

To: <a href="mailto:corpora@uib.no">corpora@uib.no</a><br>

<br>

Selon chris brew le 29/03/2010 05:37:<br>

>  I also agree with the implicit<br>

>  suggestion that keeping markup and text in the same file is not<br>

>  always the best idea.<br>

<br>

In our projects, one factor to organize the corpus architecture<br>

is to try to separate the parts that change the most often from the parts<br>

that don't change much (for example several tags - from different<br>

taggers and tag sets - from the surface of texts in NLP projects).<br>

For this, we use various XML standoff annotations techniques.<br>

We also use the one word by line technique for some part of<br>

our workflows (aka IMS CWB source format).<br>

<br>

>  it is crucial to document the format as well as you are able,<br>

>  and make clear statements about what the annotations are supposed to<br>

>  mean.<br>

<br>

We use the guidelines of, and participate to, the Text Encoding Initiative<br>

(TEI) community : <a href="http://www.tei-c.org" target="_blank">http://www.tei-c.org</a>, which documents corpora sources<br>

for that exact purpose since 1994.<br>

If you feel NLP data is not very well represented in that standard, you<br>

are welcome to propose new encodings and discuss their adoption in the<br>

annual update of the guidelines.<br>

For example, we are in a process of proposing new encodings to<br>

document all the history of the various command line tools that were<br>

called during the preparation of a corpus (tokenizers and their<br>

parameters, taggers, etc.). We would like our tools to be able to read<br>

that history for their own processing needs.<br>

Documenting is a must, but sharing that documentation between persons<br>

and softwares is a must also.<br>

<br>

--Serge Heiden<br>

<br>

--<br>

Dr. Serge Heiden, <a href="mailto:slh@ens-lyon.fr">slh@ens-lyon.fr</a>, <a href="http://textometrie.ens-lsh.fr" target="_blank">http://textometrie.ens-lsh.fr</a><br>

ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française<br>

15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883<br>

<br>

-------------- next part --------------<br>

A non-text attachment was scrubbed...<br>

Name: not available<br>

Type: text/html<br>

Size: 2436 bytes<br>

Desc: not available<br>

URL: <<a href="http://www.uib.no/mailman/public/corpora/attachments/20100329/563383f0/attachment.txt" target="_blank">http://www.uib.no/mailman/public/corpora/attachments/20100329/563383f0/attachment.txt</a>><br>

<br>

----------------------------------------------------------------------<br>

Send Corpora mailing list submissions to<br>

        <a href="mailto:corpora@uib.no">corpora@uib.no</a><br>

<br>

To subscribe or unsubscribe via the World Wide Web, visit<br>

        <a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

or, via email, send a message with subject or body 'help' to<br>

        <a href="mailto:corpora-request@uib.no">corpora-request@uib.no</a><br>

<br>

You can reach the person managing the list at<br>

        <a href="mailto:corpora-owner@uib.no">corpora-owner@uib.no</a><br>

<br>

When replying, please edit your Subject line so it is more specific<br>

than "Re: Contents of Corpora digest..."<br>

<br>

<br>

_______________________________________________<br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

<br>

<br>

End of Corpora Digest, Vol 33, Issue 31<br>

***************************************<br>

</blockquote></div><br>