28.876, Calls: Computational Linguistics, Text/Corpus Linguistics/UK

Wed Feb 15 15:47:58 UTC 2017

LINGUIST List: Vol-28-876. Wed Feb 15 2017. ISSN: 1069 - 4875.

Subject: 28.876, Calls: Computational Linguistics, Text/Corpus Linguistics/UK

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté,
                                   Michael Czerniakowski)
Homepage: http://linguistlist.org

Editor for this issue: Kenneth Steimel <ken at linguistlist.org>
================================================================

Date: Wed, 15 Feb 2017 10:47:47
From: Piotr Banski [banski at ids-mannheim.de]
Subject: Challenges in the Management of Large Corpora + Big Data and Natural Language Processing

Full Title: Challenges in the Management of Large Corpora + Big Data and Natural Language Processing 
Short Title: CMLC 5 + BigNLP 2017 

Date: 24-Jul-2017 - 24-Jul-2017
Location: Birmingham, United Kingdom 
Contact Person: Piotr Banski
Meeting Email: cmlc+bignlp at ids-mannheim.de
Web Site: http://corpora.ids-mannheim.de/cmlc-2017.html 

Linguistic Field(s): Computational Linguistics; Text/Corpus Linguistics 

Call Deadline: 12-Mar-2017 

Meeting Description:

The CMLC+BigNLP workshop is a joint initiative of two teams who have decided
to join forces for the purpose of organizing an event co-located with Corpus
Linguistics 2017 in Birmingham. The upcoming meeting continues the successful
series of “Challenges in the management of large corpora” events (previously
hosted at LREC conferences and CL2015) and is at the same time the second
event in the the Big-NLP series, inaugurated last year at the IEEE Big Data
2016 conference. This year, we wish to together explore common areas of
interest across a range of issues in language resource management, corpus
linguistics, natural language processing and data science.

An increasing amount of text is available in digital format: more historical
archives are being digitised, more publishing houses are opening their textual
assets for text mining, and many billions of words can be quickly sourced from
the web and online social media. The resulting large textual datasets are used
across a number of disciplines to answer a wide range of research questions.
In order for these datasets to be maximally useful, careful consideration
needs to be made regarding their design, collection, cleaning, encoding,
annotation, storage, retrieval and curation.

A number of key themes and questions emerge of interest to the contributing
research communities: (a) is having more data always better? (b) is the full
range of text types available online and what quality issues should we be
aware of? (c) what infrastructures and frameworks are being developed for the
efficient storage, annotation, analysis and retrieval of large datasets? (d)
what affordances do visualisation techniques offer for the exploratory
analysis approaches of corpora? (e) what are the key legal and ethical issues
related to the use of large corpora?

An open-access (CC BY-NC-ND) electronic volume of proceedings is planned.

This year’s event focuses on the union of the standard topics of CLMC and Big
NLP:

Technical issues:

- Storage and retrieval solutions for big textual data corpora: primary data,
metadata, and annotation data
- Scalable and efficient NLP tooling for annotating and analysing large
datasets: distributed and GPGPU computing; using big data analysis frameworks
(Hadoop, Spark, etc.) for language processing
- Dealing with streaming data (e.g. Social Media) and rapidly changing corpora

Licensing, legal and privacy issues:

- Licensing models of open and closed data
- Coping with intellectual property restrictions

Linguistic content issues:

- Dealing with the variety of language: multilinguality, historical texts,
user-generated content, etc.
- Integration of human computation (crowdsourcing) and automatic annotation
- Quality management of annotations

Exploitation issues:

- Query languages
- Innovative approaches for aggregation and visualisation of text analytics

Call for Papers:

We invite anonymized extended abstracts for oral presentations on the topics
listed above (PDF, 1000-1500 words excluding references, font preferably 11
pt, line spacing 1.5).

CMLC has always reserved a track for national corpus project reports, and to
this end, we invite poster proposals of 500-750 words. National project
reports need not be anonymized. The number of poster slots is limited. If
there is spare capacity in the poster session, we reserve the right to change
the presentation format of accepted papers from oral presentation to poster.
Such a change will not affect how the paper is presented in the proceedings.

Submissions are accepted exclusively through the EasyAbs submission system, at
http://linguistlist.org/easyabs/cmlc+bignlp .

Important dates

- Submission deadline: 12 of March, midnight UTC
- Notification of acceptance: 18 of April
- Camera-ready papers due: 18 of June
- Workshop date: 24 July 2016, afternoon session

----------------------------------------------------------
LINGUIST List: Vol-28-876	
----------------------------------------------------------