Appel: Final CFP and deadline extension: Challenges in the management of large corpora (@LREC-2012)

Thierry Hamon thierry.hamon at UNIV-PARIS13.FR
Wed Feb 15 16:25:06 UTC 2012

Date: Wed, 15 Feb 2012 12:14:55 +0100
From: Serge Heiden <slh at>
Message-ID: <4F3B93AF.90201 at>

[Apologies for multiple postings]

Final call for Papers: LREC 2012 Workshop


Please note: the submission deadline has been EXTENDED to February 29.

We live in an age where the well-known maxim that ?the only thing better
than data is more data? is something that no longer sets unattainable
goals. Creating extremely large corpora is no longer a challenge, given
the proven methods that lie behind e.g. applying the Web-as-Corpus
approach or utilizing Google's n-gram collection. Indeed, the challenge
is now shifted towards dealing with the large amounts of primary data
and much larger amounts of annotation data. On the one hand, this
challenge concerns finding new (corpus-) linguistic methodologies that
can make use of such /extremely large corpora/, e.g. in order to
investigate rare phenomena involving multiple lexical items or to find
and represent fine-grained sub-regularities; on the other hand, some
fundamental technical methods and strategies are being called into
question. These include e.g. successful curation of the data, management
of collections that span multiple volumes or that are distributed across
several centres, methods to clean the data from non-linguistic
intrusions or duplicates, as well as automatic annotation methods or
innovative corpus architectures that maximise the usefulness of data or
allow to search and to analyse it efficiently. Among the new tasks are
also collaborative manual annotation and methods to manage it as well as
new challenges to the statistical analysis of such data and metadata.

The half-day workshop on ?Challenges in the management of large corpora?
aims at gathering the leading researchers in the field of Language
Resource creation and Corpus Linguistics, in order to provide for an
intensive exchange of expertise, results and ideas.

We invite submissions dealing with:

*  building tools for all aspects of management of very large corpora,
*  dealing with large data sets (file system architecture, database
*  dealing with heavily annotated corpora,
*  managing multiple and concurrent annotation layers,
*  use of annotation standards for large data sets,
*  issues of interoperability and tool-chaining,
*  crowd sourcing for large data sets,
*  quality control of annotations in large data sets,
*  analytic tools used in research infrastructure initiatives, such
   as, e.g., the Common Language Resource and Technology
   Infrastructure (CLARIN),
*  dealing with corpora physically distributed over different locations,
*  managing metadata for extremely large corpus collections,
*  efficient user interfaces,
*  effective querying of large corpora with multiple annotation layers,
*  ?bringing the code to the data? as the strategy for dealing with
   IPR restrictions,
*  open-source software and open-data corpora strategies,
*  other issues that arise in the context of management of large

Current information is available at:

Abstract submission

We invite extended abstracts (1500 to 2000 words) for 20+10 minute
presentations, as well as posters and demos. All abstracts have to be
submitted via the START Conference Manager, available from

Please note: when submitting a contribution to the START, authors will
be asked to provide essential information about resources (in a broad
sense, i.e. also technologies, standards, evaluation kits, etc.) that
have been used for the work described in the contribution or are a new
result of their research. For further information on this new
initiative, please refer to

Important dates  (please note the changes!)

Workshop: 22 May 2012, afternoon session.

Deadline for submission of extended abstracts: February 29.

Notification of acceptance: March 8.

Submission of full, camera-ready papers: March 23.


The workshop will take place at the Conference venue, the Lütfi Kirdar
Istanbul Exhibition and Congress Centre. Further details will be
available in due time from conference homepage.

Organizing Committee

The workshop is co-organized by the following three institutions:

* Institut für Deutsche Sprache, Mannheim *

        Piotr Ba?ski, Marc Kupietz, Andreas Witt

* Institute for Language Information and Technology, Eastern Michigan
University *

        Helen Aristar-Dry, Anthony Aristar, Damir ?avar

* ICAR Laboratory, Lyon University *

        Serge Heiden

Programme Committee

Núria Bel (Universitat Pompeu Fabra)
Mark Davies (Brigham Young University)
Stefanie Dipper (Ruhr-Universität Bochum)
Toma? Erjavec (Jo?ef Stefan Institute)
Stefan Evert (Technische Universität Darmstadt)
Alexander Geyken (Berlin-Brandenburgische Akademie der Wissenschaften)
Andrew Hardie (University of Lancaster)
Nancy Ide (Vassar College)
Sandra Kübler (Indiana University)
Martin Mueller (Northwestern University)
Mark Olsen (University of Chicago)
Adam Przepiórkowski (Polish Academy of Sciences, University of Warsaw)
Reinhard Rapp (Johannes Gutenberg-Universität Mainz, University of Leeds)
Laurent Romary (INRIA, Humboldt-Universität zu Berlin)
Serge Sharoff (University of Leeds)
Pavel Stra?ák (Charles University in Prague)
Amir Zeldes (Humboldt-Universität zu Berlin)

=> Workshop homepage:

Dr. Serge Heiden, slh at,
ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883

Message diffuse par la liste Langage Naturel <LN at>
Informations, abonnement :
English version       : 
Archives                 :

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  :

More information about the Ln mailing list