Andrew, <br>As an addition to a couple of the responses already (Chris Brew's & Piotr's in particular), <br>I'd like to add that for public distribution and version controlling,<br>SourceForge has been a useful tool for our corpora.<br>
<br>First, SourceForge keeps all the versions of the corpora that we distribute.<br>When we publish on the use one of our corpora, we include the version number,<br>which is then freely available to the public and preserved,<br>
even if the corpus changes over time.<br><br>Also, all SourceForge projects come with Bug Tracker functionality that can be used<br>by the user community for alerting the corpus creators (us, in this case) <br>of mistakes or inconsistencies in a corpus that need to be fixed, <br>
as well as serving record of the corpus changes that we maintain.<br><br>To learn more about this and see it in action, <br>poke around on <a href="http://bionlp-corpora.sourceforge.net">bionlp-corpora.sourceforge.net</a> site.<br>
<br>-Helen Johnson<br><br><br><br>------------------------------<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<br>
Message: 10<br>
Date: Sun, 28 Mar 2010 23:37:13 -0400<br>
From: chris brew <<a href="mailto:cbrew@acm.org">cbrew@acm.org</a>><br>
Subject: Re: [Corpora-List] Using version control software in corpus<br>
construction<br>
To: Rich Cooper <<a href="mailto:rich@englishlogickernel.com">rich@englishlogickernel.com</a>><br>
Cc: <a href="mailto:CORPORA@uib.no">CORPORA@uib.no</a><br>
<br>
On Sun, Mar 28, 2010 at 12:21 PM, Rich Cooper<br>
<<a href="mailto:rich@englishlogickernel.com">rich@englishlogickernel.com</a>> wrote:<br>
> Version control is for text files that change during development. If you<br>
> put all your markup information into the text file with the actual text, it<br>
> would be encoding your versions of markup as well as your corpora of<br>
> phrases.<br>
<br>
I agree that version control is highly suitable for text files that<br>
change during<br>
a development process. I also agree with the implicit suggestion that keeping<br>
markup and text in the same file is not always the best idea.<br>
<br>
<br>
In addition, publicly accessible repositories, under version control, are a very<br>
good way of ensuring that a wider user community can access the whole<br>
development<br>
history of a corpus. This is valuable, because any corpus that is<br>
worth its salt will<br>
be used for studies at various stages during its development, and we<br>
should want to<br>
retain reproducibility even when a newer version comes along.<br>
<br>
Public repositories are also desirable because it is generally good<br>
for possible imperfections<br>
in the corpus to be exposed to as many people as possible. Corpus<br>
developers, like<br>
software developers, should be keen for their bugs to be fixed by<br>
others. This is a robust<br>
finding for software. Many software developers are now accustomed to<br>
the slightly queasy feeling of<br>
putting stuff out their despite its probable imperfections, and have<br>
found that the benefits of exposure<br>
justify the risk.<br>
<br>
This open-source model is not so attractive if you are constrained by<br>
copyright or by institutional<br>
policy to NOT make the corpus fully available in an open-source form.<br>
In that case<br>
you might still want to use version control, but in a private<br>
repository. And perhaps<br>
to agitate for the copyright release or change in institutional policy<br>
that would allow<br>
you to fully benefit from the help of others.<br>
<br>
I'm neutral on whether the format of the corpus should be defined with<br>
XML schemas, SQL, or something else,<br>
but insistent on the merits of defining it in some way that is<br>
amenable to automated checking, and available<br>
for extension and modification by others. It isn't necessarily crucial<br>
to get everything about the format right<br>
from the outset, it is crucial to document the format as well as you<br>
are able, and make clear statements about<br>
what the annotations are supposed to mean. The fact that LDC did this<br>
documentation task well with the Penn Treebank is the reason why<br>
others have been able to use, extend and transform it in all kinds of<br>
interesting ways.<br>
And also to find errors and inconsistencies in it. If we didn't know<br>
what the data was supposed to be like, we'd have no chance of telling<br>
when errors were happening.<br>
<br>
<br>
<br>
------------------------------<br>
<br>
Message: 11<br>
Date: Mon, 29 Mar 2010 15:29:10 +0900<br>
From: Takehito UTSURO <<a href="mailto:utsuro@iit.tsukuba.ac.jp">utsuro@iit.tsukuba.ac.jp</a>><br>
Subject: [Corpora-List] CFP: COLING2010 Workshop: NLPIX2010<br>
To: <a href="mailto:corpora@uib.no">corpora@uib.no</a><br>
<br>
***** Apologies if you receive multiple copies of this CFP. *****<br>
<br>
==================================================================<br>
<br>
CALL FOR PAPERS<br>
COLING2010 Workshop<br>
<br>
The Second International Workshop on NLP Challenges<br>
in the Information Explosion Era (NLPIX 2010)<br>
-- Large-scale and sharable NLP infrastructures and beyond --<br>
<br>
Beijing, China, August, 28, 2010<br>
<br>
<br>
Workshop Web Site: <a href="http://nlp.kuee.kyoto-u.ac.jp/NLPIX2010/index.html" target="_blank">http://nlp.kuee.kyoto-u.ac.jp/NLPIX2010/index.html</a><br>
<br>
In Cooperation With: Info-plosion<br>
<a href="http://www.infoplosion.nii.ac.jp/info-plosion/ctr.php/m/IndexEng/a/Index/" target="_blank">http://www.infoplosion.nii.ac.jp/info-plosion/ctr.php/m/IndexEng/a/Index/</a><br>
<br>
<br>
==================================================================<br>
Submission deadline: May 30, 2010<br>
==================================================================<br>
<br>
[Workshop Theme and Topics]<br>
<br>
A long-standing problem in Natural Language Processing has been a lack<br>
of large-scale knowledge for computers. The emergence of the Web and<br>
the rapid increase of information on the Web brought us to what could<br>
be called the "information explosion era," and drastically changed the<br>
environment of NLP. The Web is not only a marvelous target for NLP,<br>
but also a valuable resource from which knowledge could be extracted<br>
for computers. Motivated by the desire to have a very first<br>
opportunity to discuss early approaches to those issues and to share<br>
the state-of-the-art technologies at that time, the first<br>
International Workshop on NLP Challenges in the Information Explosion<br>
Era (NLPIX 2008) was successfully held in conjunction with WWW 2008 in<br>
Beijing. The aim of the second workshop of the series of International<br>
Workshop NLPIX is to bring researchers and practitioners together in<br>
order to discuss large-scale and sharable NLP infrastructures, and<br>
furthermore to discuss emerging NEW issues beyond them. Possible<br>
topics of the paper submissions include, but are not limited to:<br>
<br>
* Construction of large corpora (crawling, preprocessing)<br>
* Sharable large resources (e.g., Google N-gram statistics, etc.)<br>
* Standard for a linguistic annotation framework<br>
* Knowledge acquisition from very large corpora<br>
* Bootstrapping approach for knowledge acquisition<br>
* Large scale text mining based on shallow/deep NLP<br>
* Managing and sharing acquired knowledge<br>
* Exploitation of acquired knowledge for real applications<br>
* Knowledge-based information access, analysis, and organization<br>
* High performance/parallel computing environment for NLP<br>
* Cloud computing for NLP<br>
<br>
In particular, we solicit the papers that aim at fulfilling a NOVEL<br>
type of needs in Web access and that can provide a new insight into<br>
future directions of Web access research.<br>
<br>
<br>
[Workshop Schedule/Important Dates]<br>
<br>
* Submission deadline: May 30, 2010<br>
* Notification of acceptance: June 30, 2010<br>
* Workshop date: August 28, 2010<br>
<br>
[Submission Format]<br>
<br>
Paper submissions should follow the COLING 2010 paper submission<br>
policy, including paper format, blind review policy and title and<br>
author format convention<br>
(<a href="http://www.coling-2010.org/SubmissionGuideline.htm" target="_blank">http://www.coling-2010.org/SubmissionGuideline.htm</a>).<br>
Papers should not exceed 10 pages, including references. Middle-sized<br>
papers (e.g., 6-8 pages) are also welcome. Submission is electronic<br>
using paper submission software. Online submission system will be set<br>
up soon.<br>
<br>
<br>
[Workshop Organizers]<br>
<br>
* Sadao Kurohashi, Kyoto University, Japan<br>
* Takehito Utsuro, University of Tsukuba, Japan<br>
<br>
<br>
[Program Committee]<br>
<br>
* Pushpak Bhattacharyya, IIT, India<br>
* Thorsten Brants, Google, USA<br>
* Eric Villemonte de la Clergerie, INRIA, France<br>
* Atsushi Fujii, Tokyo Institute of Technology, Japan<br>
* Julio Gonzalo, UNED, Spain<br>
* Kentaro Inui, Tohoku University, Japan<br>
* Noriko Kando, NII, Japan<br>
* Daisuke Kawahara, NICT, Japan<br>
* Jun'ichi Kazama, NICT, Japan<br>
* Adam Kilgarriff, Lexical Computing Ltd., UK<br>
* Gary Geunbae Lee, POSTECH, Korea<br>
* Hang Li, Microsoft, China<br>
* Dekang Lin, Google, USA<br>
* Tatsunori Mori, Yokohama National University, Japan<br>
* Satoshi Sekine, New York University, USA<br>
* Kenjiro Taura, University of Tokyo, Japan<br>
* Kentaro Torisawa, NICT, Japan<br>
* Marco Turchi, European Commission - Joint Research Centre, Italy<br>
* Yunqing Xia, The Chinese University of Hong Kong, China<br>
<br>
<br>
[Previous NLPIX Workshop]<br>
<br>
NLP Challenges in the Information Explosion Era (NLPIX 2008),<br>
at WWW2008 in Beijing, China.<br>
<a href="http://www.cl.cs.titech.ac.jp/%7Efujii/NLPIX2008/" target="_blank">http://www.cl.cs.titech.ac.jp/~fujii/NLPIX2008/</a><br>
<br>
<br>
[Contact Us]<br>
<br>
Email: <a href="mailto:nlpix2010@nlp.kuee.kyoto-u.ac.jp">nlpix2010@nlp.kuee.kyoto-u.ac.jp</a><br>
<br>
<br>
<br>
------------------------------<br>
<br>
Message: 12<br>
Date: Mon, 29 Mar 2010 09:18:02 +0200<br>
From: Serge Heiden <<a href="mailto:slh@ens-lyon.fr">slh@ens-lyon.fr</a>><br>
Subject: Re: [Corpora-List] Using version control software in corpus<br>
construction<br>
To: <a href="mailto:corpora@uib.no">corpora@uib.no</a><br>
<br>
Selon chris brew le 29/03/2010 05:37:<br>
> I also agree with the implicit<br>
> suggestion that keeping markup and text in the same file is not<br>
> always the best idea.<br>
<br>
In our projects, one factor to organize the corpus architecture<br>
is to try to separate the parts that change the most often from the parts<br>
that don't change much (for example several tags - from different<br>
taggers and tag sets - from the surface of texts in NLP projects).<br>
For this, we use various XML standoff annotations techniques.<br>
We also use the one word by line technique for some part of<br>
our workflows (aka IMS CWB source format).<br>
<br>
> it is crucial to document the format as well as you are able,<br>
> and make clear statements about what the annotations are supposed to<br>
> mean.<br>
<br>
We use the guidelines of, and participate to, the Text Encoding Initiative<br>
(TEI) community : <a href="http://www.tei-c.org" target="_blank">http://www.tei-c.org</a>, which documents corpora sources<br>
for that exact purpose since 1994.<br>
If you feel NLP data is not very well represented in that standard, you<br>
are welcome to propose new encodings and discuss their adoption in the<br>
annual update of the guidelines.<br>
For example, we are in a process of proposing new encodings to<br>
document all the history of the various command line tools that were<br>
called during the preparation of a corpus (tokenizers and their<br>
parameters, taggers, etc.). We would like our tools to be able to read<br>
that history for their own processing needs.<br>
Documenting is a must, but sharing that documentation between persons<br>
and softwares is a must also.<br>
<br>
--Serge Heiden<br>
<br>
--<br>
Dr. Serge Heiden, <a href="mailto:slh@ens-lyon.fr">slh@ens-lyon.fr</a>, <a href="http://textometrie.ens-lsh.fr" target="_blank">http://textometrie.ens-lsh.fr</a><br>
ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française<br>
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883<br>
<br>
-------------- next part --------------<br>
A non-text attachment was scrubbed...<br>
Name: not available<br>
Type: text/html<br>
Size: 2436 bytes<br>
Desc: not available<br>
URL: <<a href="http://www.uib.no/mailman/public/corpora/attachments/20100329/563383f0/attachment.txt" target="_blank">http://www.uib.no/mailman/public/corpora/attachments/20100329/563383f0/attachment.txt</a>><br>
<br>
----------------------------------------------------------------------<br>
Send Corpora mailing list submissions to<br>
<a href="mailto:corpora@uib.no">corpora@uib.no</a><br>
<br>
To subscribe or unsubscribe via the World Wide Web, visit<br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
or, via email, send a message with subject or body 'help' to<br>
<a href="mailto:corpora-request@uib.no">corpora-request@uib.no</a><br>
<br>
You can reach the person managing the list at<br>
<a href="mailto:corpora-owner@uib.no">corpora-owner@uib.no</a><br>
<br>
When replying, please edit your Subject line so it is more specific<br>
than "Re: Contents of Corpora digest..."<br>
<br>
<br>
_______________________________________________<br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
<br>
<br>
End of Corpora Digest, Vol 33, Issue 31<br>
***************************************<br>
</blockquote></div><br>