[Corpora-List] Corpora Digest, Vol 33, Issue 31

Mon Mar 29 17:56:32 UTC 2010

Andrew,
As an addition to a couple of the responses already (Chris Brew's &  Piotr's
in particular),
I'd like to add that for public distribution and version controlling,
SourceForge has been a useful tool for our corpora.

First, SourceForge keeps all the versions of the corpora that we distribute.
When we publish on the use one of our corpora, we include the version
number,
which is then freely available to the public and preserved,
even if the corpus changes over time.

Also, all SourceForge projects come with Bug Tracker functionality that can
be used
by the user community for alerting the corpus creators (us, in this case)
of mistakes or inconsistencies in a corpus that need to be fixed,
as well as serving record of the corpus changes that we maintain.

To learn more about this and see it in action,
poke around on bionlp-corpora.sourceforge.net site.

-Helen Johnson

------------------------------

>
> Message: 10
> Date: Sun, 28 Mar 2010 23:37:13 -0400
> From: chris brew <cbrew at acm.org>
> Subject: Re: [Corpora-List] Using version control software in corpus
>        construction
> To: Rich Cooper <rich at englishlogickernel.com>
> Cc: CORPORA at uib.no
>
> On Sun, Mar 28, 2010 at 12:21 PM, Rich Cooper
> <rich at englishlogickernel.com> wrote:
> > Version control is for text files that change during development.  If you
> > put all your markup information into the text file with the actual text,
> it
> > would be encoding your versions of markup as well as your corpora of
> > phrases.
>
> I agree that version control is highly suitable for text files that
> change during
> a development process. I also agree with the implicit suggestion that
> keeping
> markup and text in the same file is not always the best idea.
>
>
> In addition, publicly accessible repositories, under version control, are a
> very
> good way of ensuring that a wider user community can access the whole
> development
> history of a corpus. This is valuable, because any corpus that is
> worth its salt will
> be used for studies at various stages during its development, and we
> should want to
> retain reproducibility even when a newer version comes along.
>
> Public repositories are also desirable because it is generally good
> for possible imperfections
> in the corpus to be exposed to as many people as possible. Corpus
> developers, like
> software developers, should be keen for their bugs to be fixed by
> others. This is a robust
> finding for software. Many software developers are now accustomed to
> the slightly queasy feeling of
> putting stuff out their despite its probable imperfections, and have
> found that the benefits of exposure
> justify the risk.
>
> This open-source model is not so attractive if you are constrained by
> copyright or by institutional
> policy to NOT make the corpus fully available in an open-source form.
> In that case
> you might still want to use version control, but in a private
> repository. And perhaps
> to agitate for the copyright release or change in institutional policy
> that would allow
> you to fully benefit from the help of others.
>
> I'm neutral on whether the format of the corpus should be defined with
> XML schemas, SQL, or something else,
> but insistent on the merits of defining it in some way that is
> amenable to automated checking, and available
> for extension and modification by others. It isn't necessarily crucial
> to get everything about the format right
> from the outset, it is crucial to document the format as well as you
> are able, and make clear statements about
> what the annotations are supposed to mean. The fact that LDC did this
> documentation task well with the Penn Treebank is the reason why
> others have been able to use, extend and transform it in all kinds of
> interesting ways.
> And also to find errors and inconsistencies in it. If we didn't know
> what the data was supposed to be like, we'd have no chance of telling
> when errors were happening.
>
>
>
> ------------------------------
>
> Message: 11
> Date: Mon, 29 Mar 2010 15:29:10 +0900
> From: Takehito UTSURO <utsuro at iit.tsukuba.ac.jp>
> Subject: [Corpora-List] CFP: COLING2010 Workshop: NLPIX2010
> To: corpora at uib.no
>
> ***** Apologies if you receive multiple copies of this CFP. *****
>
> ==================================================================
>
>                      CALL FOR PAPERS
>                     COLING2010 Workshop
>
>    The Second International Workshop on NLP Challenges
>       in the Information Explosion Era (NLPIX 2010)
>  -- Large-scale and sharable NLP infrastructures and beyond --
>
>               Beijing, China, August, 28, 2010
>
>
> Workshop Web Site: http://nlp.kuee.kyoto-u.ac.jp/NLPIX2010/index.html
>
> In Cooperation With: Info-plosion
> http://www.infoplosion.nii.ac.jp/info-plosion/ctr.php/m/IndexEng/a/Index/
>
>
> ==================================================================
> Submission deadline: May 30, 2010
> ==================================================================
>
> [Workshop Theme and Topics]
>
> A long-standing problem in Natural Language Processing has been a lack
> of large-scale knowledge for computers. The emergence of the Web and
> the rapid increase of information on the Web brought us to what could
> be called the "information explosion era," and drastically changed the
> environment of NLP. The Web is not only a marvelous target for NLP,
> but also a valuable resource from which knowledge could be extracted
> for computers. Motivated by the desire to have a very first
> opportunity to discuss early approaches to those issues and to share
> the state-of-the-art technologies at that time, the first
> International Workshop on NLP Challenges in the Information Explosion
> Era (NLPIX 2008) was successfully held in conjunction with WWW 2008 in
> Beijing. The aim of the second workshop of the series of International
> Workshop NLPIX is to bring researchers and practitioners together in
> order to discuss large-scale and sharable NLP infrastructures, and
> furthermore to discuss emerging NEW issues beyond them. Possible
> topics of the paper submissions include, but are not limited to:
>
>   * Construction of large corpora (crawling, preprocessing)
>   * Sharable large resources (e.g., Google N-gram statistics, etc.)
>   * Standard for a linguistic annotation framework
>   * Knowledge acquisition from very large corpora
>   * Bootstrapping approach for knowledge acquisition
>   * Large scale text mining based on shallow/deep NLP
>   * Managing and sharing acquired knowledge
>   * Exploitation of acquired knowledge for real applications
>   * Knowledge-based information access, analysis, and organization
>   * High performance/parallel computing environment for NLP
>   * Cloud computing for NLP
>
> In particular, we solicit the papers that aim at fulfilling a NOVEL
> type of needs in Web access and that can provide a new insight into
> future directions of Web access research.
>
>
> [Workshop Schedule/Important Dates]
>
>  * Submission deadline: May 30, 2010
>  * Notification of acceptance: June 30, 2010
>  * Workshop date: August 28, 2010
>
> [Submission Format]
>
> Paper submissions should follow the COLING 2010 paper submission
> policy, including paper format, blind review policy and title and
> author format convention
> (http://www.coling-2010.org/SubmissionGuideline.htm).
> Papers should not exceed 10 pages, including references. Middle-sized
> papers (e.g., 6-8 pages) are also welcome.  Submission is electronic
> using paper submission software. Online submission system will be set
> up soon.
>
>
> [Workshop Organizers]
>
> * Sadao Kurohashi, Kyoto University, Japan
> * Takehito Utsuro, University of Tsukuba, Japan
>
>
> [Program Committee]
>
> * Pushpak Bhattacharyya, IIT, India
> * Thorsten Brants, Google, USA
> * Eric Villemonte de la Clergerie, INRIA, France
> * Atsushi Fujii, Tokyo Institute of Technology, Japan
> * Julio Gonzalo, UNED, Spain
> * Kentaro Inui, Tohoku University, Japan
> * Noriko Kando, NII, Japan
> * Daisuke Kawahara, NICT, Japan
> * Jun'ichi Kazama, NICT, Japan
> * Adam Kilgarriff, Lexical Computing Ltd., UK
> * Gary Geunbae Lee, POSTECH, Korea
> * Hang Li, Microsoft, China
> * Dekang Lin, Google, USA
> * Tatsunori Mori, Yokohama National University, Japan
> * Satoshi Sekine, New York University, USA
> * Kenjiro Taura, University of Tokyo, Japan
> * Kentaro Torisawa, NICT, Japan
> * Marco Turchi, European Commission - Joint Research Centre, Italy
> * Yunqing Xia, The Chinese University of Hong Kong, China
>
>
> [Previous NLPIX Workshop]
>
> NLP Challenges in the Information Explosion Era (NLPIX 2008),
> at WWW2008 in Beijing, China.
> http://www.cl.cs.titech.ac.jp/~fujii/NLPIX2008/<http://www.cl.cs.titech.ac.jp/%7Efujii/NLPIX2008/>
>
>
> [Contact Us]
>
>  Email: nlpix2010 at nlp.kuee.kyoto-u.ac.jp
>
>
>
> ------------------------------
>
> Message: 12
> Date: Mon, 29 Mar 2010 09:18:02 +0200
> From: Serge Heiden <slh at ens-lyon.fr>
> Subject: Re: [Corpora-List] Using version control software in   corpus
>        construction
> To: corpora at uib.no
>
> Selon chris brew le 29/03/2010 05:37:
> >  I also agree with the implicit
> >  suggestion that keeping markup and text in the same file is not
> >  always the best idea.
>
> In our projects, one factor to organize the corpus architecture
> is to try to separate the parts that change the most often from the parts
> that don't change much (for example several tags - from different
> taggers and tag sets - from the surface of texts in NLP projects).
> For this, we use various XML standoff annotations techniques.
> We also use the one word by line technique for some part of
> our workflows (aka IMS CWB source format).
>
> >  it is crucial to document the format as well as you are able,
> >  and make clear statements about what the annotations are supposed to
> >  mean.
>
> We use the guidelines of, and participate to, the Text Encoding Initiative
> (TEI) community : http://www.tei-c.org, which documents corpora sources
> for that exact purpose since 1994.
> If you feel NLP data is not very well represented in that standard, you
> are welcome to propose new encodings and discuss their adoption in the
> annual update of the guidelines.
> For example, we are in a process of proposing new encodings to
> document all the history of the various command line tools that were
> called during the preparation of a corpus (tokenizers and their
> parameters, taggers, etc.). We would like our tools to be able to read
> that history for their own processing needs.
> Documenting is a must, but sharing that documentation between persons
> and softwares is a must also.
>
> --Serge Heiden
>
> --
> Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lsh.fr
> ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
> 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883
>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: text/html
> Size: 2436 bytes
> Desc: not available
> URL: <
> http://www.uib.no/mailman/public/corpora/attachments/20100329/563383f0/attachment.txt
> >
>
> ----------------------------------------------------------------------
> Send Corpora mailing list submissions to
>        corpora at uib.no
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://mailman.uib.no/listinfo/corpora
> or, via email, send a message with subject or body 'help' to
>        corpora-request at uib.no
>
> You can reach the person managing the list at
>        corpora-owner at uib.no
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Corpora digest..."
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
> End of Corpora Digest, Vol 33, Issue 31
> ***************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100329/ad1b91bf/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora