23.343, Calls: Text/Corpus Linguistics/Turkey

Thu Jan 19 16:20:52 UTC 2012

LINGUIST List: Vol-23-343. Thu Jan 19 2012. ISSN: 1069 - 4875.

Subject: 23.343, Calls: Text/Corpus Linguistics/Turkey

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews: Veronika Drake, U of Wisconsin-Madison
Monica Macaulay, U of Wisconsin-Madison
Rajiv Rao, U of Wisconsin-Madison
Joseph Salmons, U of Wisconsin-Madison
Anja Wanner, U of Wisconsin-Madison
       <reviews at linguistlist.org>

Homepage: http://linguistlist.org

The LINGUIST List is funded by Eastern Michigan University,
and donations from subscribers and publishers.

Editor for this issue: Alison Zaharee <alison at linguistlist.org>

LINGUIST is pleased to announce the launch of an exciting new feature:  
Easy Abstracts! Easy Abs is a free abstract submission and review facility 
designed to help conference organizers and reviewers accept and process 
abstracts online.  Just go to: http://www.linguistlist.org/confcustom, 
and begin your conference customization process today! With Easy Abstracts, 
submission and review will be as easy as 1-2-3!


Date: 19-Jan-2012
From: Piotr Banski [banski at ids-mannheim.de]
Subject: Challenges in the Management of Large Corpora

-------------------------Message 1 ---------------------------------- 
Date: Thu, 19 Jan 2012 11:20:09
From: Piotr Banski [banski at ids-mannheim.de]
Subject: Challenges in the Management of Large Corpora

E-mail this message to a friend:
Full Title: Challenges in the Management of Large Corpora 
Short Title: CMLC 

Date: 22-May-2012 - 22-May-2012
Location: Istanbul, Turkey 
Contact Person: Piotr Banski
Meeting Email: banski at ids-mannheim.de
Web Site: http://corpora.ids-mannheim.de/cmlc.html 

Linguistic Field(s): Text/Corpus Linguistics 

Call Deadline: 15-Feb-2012 

Meeting Description:

We live in an age where the well-known maxim that 'the only thing better than data is more data' is something that no longer sets unattainable goals. Creating extremely large corpora is no longer a challenge, given the proven methods that lie behind e.g. applying the Web-as-Corpus approach or utilizing Google's n-gram collection. Indeed, the challenge is now shifted towards dealing with the large amounts of primary data and much larger amounts of annotation data. On the one hand, this challenge concerns finding new (corpus-) linguistic methodologies that can make use of such /extremely large corpora/, e.g. in order to investigate rare phenomena involving multiple lexical items or to find and represent fine-grained sub-regularities; on the other hand, some fundamental technical methods and strategies are being called into question. These include e.g. successful curation of the data, management of collections that span multiple volumes or that are distributed across several centres, methods to clean the data from non-linguistic intrusions or duplicates, as well as automatic annotation methods or innovative corpus architectures that maximise the usefulness of data or allow to search and to analyse it efficiently. Among the new tasks are also collaborative manual annotation and methods to manage it as well as new challenges to the statistical analysis of such data and metadata.

The half-day LREC-2012 workshop on 'Challenges in the Management of Large Corpora' aims at gathering the leading researchers in the field of Language Resource creation and Corpus Linguistics, in order to provide for an intensive exchange of expertise, results and ideas.


The workshop will take place at the Conference venue, the Lütfi Kirdar Istanbul Exhibition and Congress Centre. Further details will be available in due time from conference homepage. 

Call for Papers:

We invite submissions dealing with:

- Building tools for all aspects of management of very large corpora
- Dealing with large data sets (file system architecture, database architecture)
- Dealing with heavily annotated corpora
- Managing multiple and concurrent annotation layers
- Use of annotation standards for large data sets
- Issues of interoperability and tool-chaining
- Crowd sourcing for large data sets
- Quality control of annotations in large data sets
- Analytic tools used in research infrastructure initiatives, such as, e.g., the Common Language Resource and Technology Infrastructure (CLARIN)
- Dealing with corpora physically distributed over different locations
- Managing metadata for extremely large corpus collections
- Efficient user interfaces
- Effective querying of large corpora with multiple annotation layers
- 'Bringing the code to the data' as the strategy for dealing with IPR restrictions
- Open-source software and open-data corpora strategies
- Other issues that arise in the context of management of large datasets

Current information is available at:


Abstract Submission:

We invite extended abstracts (1500 to 2000 words) for 20+10 minute presentations, as well as posters and demos. All abstracts have to be submitted via the START Conference Manager, available at:


Please note: when submitting a contribution to the START, authors will be asked to provide essential information about resources (in a broad sense, i.e. also technologies, standards, evaluation kits, etc.) that have been used for the work described in the contribution or are a new result of their research. For further information on this new initiative, please refer to:


Important Dates:

Deadline for submission of extended abstracts: 15 February 2012
Notification of acceptance: 29 February 2012
Submission of full, camera-ready papers: 23 March 2012
Workshop: 22 May 2012, afternoon session

Organizing Committee:

The workshop is co-organized by the following three institutions:

Institut für Deutsche Sprache, Mannheim - Piotr Bański, Marc Kupietz, Andreas Witt
Institute for Language Information and Technology, Eastern Michigan University - Helen Aristar-Dry, Anthony Aristar, Damir Ćavar
ICAR Laboratory, Lyon University - Serge Heiden

Programme Committee:

Núria Bel (Universitat Pompeu Fabra)
Mark Davies (Brigham Young University)
Stefanie Dipper (Ruhr-Universität Bochum)
Tomaž Erjavec (Jožef Stefan Institute)
Stefan Evert (Technische Universität Darmstadt)
Alexander Geyken (Berlin-Brandenburgische Akademie der Wissenschaften)
Andrew Hardie (University of Lancaster)
Nancy Ide (Vassar College)
Sandra Kübler (Indiana University)
Martin Mueller (Northwestern University)
Mark Olsen (University of Chicago)
Adam Przepiórkowski (Polish Academy of Sciences, University of Warsaw)
Reinhard Rapp (Johannes Gutenberg-Universität Mainz, University of Leeds)
Laurent Romary (INRIA, Humboldt-Universität zu Berlin)
Serge Sharoff (University of Leeds)
Pavel Straňák (Charles University in Prague)
Amir Zeldes (Humboldt-Universität zu Berlin)

LINGUIST List: Vol-23-343	

More information about the Linguist mailing list