[Corpora-List] DEADLINE EXTENSION: Challenges in the management of large corpora (CMLC-2) - LREC 2014 satellite workshop

Mon Feb 10 09:39:24 UTC 2014

                         CfP / DEADLINE EXTENSION

         Challenges in the management of large corpora (CMLC-2)
                http://corpora.ids-mannheim.de/cmlc.html

                 Workshop in conjunction with LREC 2014
                           26-31 May, Reykjavik

             Workshop date: Saturday, 31 May 2014 (afternoon)

           *** NEW Submission deadline: Thursday, 20 February 2014 ***

Workshop description

We live in an age where the well-known maxim that "the only thing better
than data is more data" is something that no longer sets unattainable
goals. Creating extremely large corpora is no longer a challenge, given
the proven methods that lie behind e.g. applying the Web-as-Corpus
approach or utilizing Google's n-gram collection. Indeed, the challenge
is now shifted towards dealing with large amounts of primary data and
much larger amounts of annotation data. On the one hand, this challenge
concerns finding new (corpus-)linguistic methodologies that can make use
of such extremely large corpora e.g. in order to investigate rare
phenomena involving multiple lexical items, to find and represent
fine-grained sub-regularities, or to investigate variations within and
across language domains; on the other hand, some fundamental technical
methods and strategies are being called into question. These include
e.g. successful curation of data, management of collections that span
multiple volumes or that are distributed across several centres, methods
to clean the data from non-linguistic intrusions or duplicates, as well
as automatic annotation methods or innovative corpus architectures that
maximise the usefulness of data or allow to search and to analyze it
efficiently. Among the new tasks are also collaborative manual
annotation and methods to manage it as well as new challenges to the
statistical analysis of such data and metadata.

Motivation and Topics of interest

The second LREC-workshop on "Challenges in the management of large
corpora" aims at gathering the leading researchers in the fields of
Language Resource creation and Corpus Linguistics, in order to provide
for an intensive exchange of expertise, results and ideas. In accordance
with this LREC's hot topic: "Big Data", contributions concerned with
national corpora, reference corpora and other very large corpora are
particularly welcome.
The half day workshop will be wrapped up with a discussion about the
common challenges, ideas for possible solutions and potential
co-operations. We invite submissions dealing with:

* tools for all aspects of management of very large corpora,
* evaluation and investigation of the properties of large corpora
* system- and database architectures for very large semi-structured data
sets,
* heavily annotated corpora,
* managing multiple and concurrent annotation layers,
* use of annotation standards for large data sets,
* issues of interoperability and tool-chaining,
* crowdsourcing for large data sets,
* quality control of annotations in large data sets,
* dealing with corpora physically distributed over different locations,
* efficient and scalable user interfaces,
* effective querying of large corpora with multiple annotation layers,
* "put the computation near the data" as strategy for dealing with
  IPR restrictions,
* open-source software and open-data corpora strategies,
* other issues that arise in the context of management of large
datasets.

Summary of the Call

The workshop aims at gathering the leading researchers in the field of
Language Resource creation and Corpus Linguistics, in order to provide
for an intensive exchange of expertise, results and ideas concerning the
issues mentioned as "topics of interest" above, and primarily concerning
the creation, maintenance, extensibility and use of *large* and richly
annotated linguistic data sets, well above 1 billion (1*10^9) of tokens
and nearing the petabyte range of volume.

Abstract submission

We invite extended abstracts for 15 to 20 minute presentations (4 pages
maximum). All abstracts have to be submitted via the START Conference
Manager at https://www.softconf.com/lrec2014/CMLC-2/ . Please note: When
submitting a paper from the START page, authors will be asked to provide
essential information about resources (in a broad sense, i.e. also
technologies, standards, evaluation kits, etc.) that have been used for
the work described in the paper or are a new result of your research.
Moreover, ELRA encourages all LREC authors to share the described LRs
(data, tools, services, etc.), to enable their reuse, replicability of
experiments, including evaluation ones, etc...

Important dates

Deadline for submissions: Thursday, 20 February 2014
Notification of acceptance: Monday, 10 March 2014

Venue

The half-day workshop will take place at the Conference_venue, the Harpa
Conference Centre, in the afternoon session of Saturday, 31 May 2014.

Organizing Committee

The workshop is co-organized by the following institutions:

Institut für Deutsche Sprache, Mannheim
Piotr Bański, Marc Kupietz, Harald Lüngen, Andreas Witt

Institute for Corpus Linguistics and Text Technology, Vienna
Evelyn Breiteneder, Hanno Biber, Karlheinz Mörth

Programme committee:

* Lars Borin (University of Gothenburg)
* Dan Cristea ("Alexandru Ioan Cuza" University of Iasi)
* Václav Cvrček  (Charles University Prague)
* Mark Davies (Brigham Young University)
* Tomaž Erjavec (Jožef Stefan Institute, Ljubljana)
* Alexander Geyken (Berlin-Brandenburgische Akademie der Wissenschaften)
* Andrew Hardie (University of Lancaster)
* Nancy Ide (Vassar College)
* Milos Jakubicek (Lexical Computing Ltd.)
* Adam Kilgarriff (Lexical Computing Ltd.)
* Krister Lindén (University of Helsinki)
* Jean-Luc Minel (Université Paris Ouest Nanterre La Défense)
* Christian Emil Ore (University of Oslo)
* Adam Przepiórkowski (Polish Academy of Sciences, University of Warsaw)
* Uwe Quasthoff (Leipzig University)
* Pavel Rychlý (Masaryk University Brno)
* Roland Schäfer (FU Berlin)
* Marko Tadić (University of Zagreb)
* Dan Tufiş (Romanian Academy, Bucharest)
* Tamás Váradi (Hungarian Academy of Sciences, Budapest)

Workshop homepage: http://corpora.ids-mannheim.de/cmlc.html

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora