27.752, Calls: German, Computational Ling/Germany

Tue Feb 9 16:01:48 UTC 2016

LINGUIST List: Vol-27-752. Tue Feb 09 2016. ISSN: 1069 - 4875.

Subject: 27.752, Calls: German, Computational Ling/Germany

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry, Sara Couture)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
                   25 years of LINGUIST List!
Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Ashley Parker <ashley at linguistlist.org>
================================================================

Date: Tue, 09 Feb 2016 11:01:22
From: Kay-Michael Würzner [wuerzner at bbaw.de]
Subject: GSCL Shared Task: Automatic Linguistic Annotation of Computer-Mediated Communication

Full Title: GSCL Shared Task: Automatic Linguistic Annotation of Computer-Mediated Communication 
Short Title: EmpiriST 2015 

Date: 15-Feb-2016 - 26-Feb-2016
Location: None, Germany 
Contact Person: Kay-Michael Würzner
Meeting Email: empirist at collocations.de
Web Site: https://sites.google.com/site/empirist2015/ 

Linguistic Field(s): Computational Linguistics 

Subject Language(s): German (deu)

Call Deadline: 14-Feb-2016 

Meeting Description:

The goal of this shared task is to encourage the developers of NLP
applications to adapt their tools and resources for the processing of written
German discourse in genres of computer-mediated communication (CMC). Examples
for CMC genres are chats, forums, wiki talk pages, tweets, blog comments,
social networks, SMS and WhatsApp dialogues.

Processing CMC discourse is a desideratum and a relevant task in different
research fields and application contexts in the Digital Humanities - e.g.:

- in the context of building, processing and analyzing corpora of
computer-mediated communication / social media (chat corpora, news corpora,
whatsapp corpora, ...)
- in the context of collecting, processing and analyzing large,
genre-heterogenous web corpora as resources in the field of Language
Technology / Data Mining
- in the context of dealing with CMC data in corpus-based analyses on
contemporary written language, language variation and language change
- in all research fields beyond linguistics which address social, cultural and
educational aspects of social media and CMC technologies using language data
from CMC genres

The shared task consists of two subtasks:

Tokenization of CMC discourse
Part-of-speech tagging of CMC discourse

The two subtasks will have to be handled for two different data sets:

- CMC data set: a selection of data from different CMC genres (social chat,
professional chat, Wikipedia talk pages, blog comments, tweets, WhatsApp
dialogues).
- Web corpora data set: a selection of data which represents written discourse
from heterogenuous WWW genres. It consists of crawled websites including small
portions of CMC discourse (e.g. webpages, blogs, news sites, blog commentary
etc.).

We will provide training data sets which have been manually tokenized and
tagged on the basis of detailed annotation guidelines.

Before the release of the full task we will publish a small set of trial data
which may be used by developers. Annotation guidelines which have been used
for annotating the trial and training data are available, too.

The shared task (ST) has been prepared by members of the DFG scientific
network Empirikom (therefore: ''EmpiriST''). Its preparation has been funded
by the German Society for Language Technology and Computational Linguistics
(GSCL).

The shared task is endorsed by the ACL Special Interest Group on the Web as
Corpus and by the GSCL Special Interest Group on Social Media /
Computer-Mediated Communication.

2nd Call for Participation: 

https://sites.google.com/site/empirist2015/

UPDATED SCHEDULE

20.12.2015 - Release of the training data
*14.02.2016* - Extended deadline for team registration
15.02.2016 - Release of the evaluation data for the tokenization subtask
19.02.2016 - Submission deadline for the tokenization subtask
22.02.2016 - Release of the evaluation data for the POS-tagging subtask
26.02.2016 - Submission deadline for the POS-tagging subtask
*08.05.2016* - Submission of system description papers (4 pages + references)
12.08.2016 - Presentation of systems and task results at WAC-X workshop (ACL
2016, Berlin)

Note that a postponed schedule for the evaluation period was temporarily shown
on the task Web site by mistake. The correct schedule is as shown above, with
evaluation taking place from Feb 19 to Feb 26.

REGISTRATION

In order to register as a competitor for EmpiriST 2015, please send a message
to empirist at collocations.de containing the following information:

- Team name (will be used to identify submissions)
- Name(s) of team member(s)
- Affiliation(s)
- Subtasks you plan to participate in (CMC Tok, CMC PoS, Web Tok, Web PoS)
- Contact person and e-mail address

Task participants should also join our Google group at
https://groups.google.com/d/forum/empirist2015

DETAILS

The EmpiriST 2015 shared task aims to encourage the developers of NLP
applications to adapt their tools and resources for the processing of written
German discourse in genres of computer-mediated communication (CMC) – such as
chats, forums, wiki talk pages, tweets, blog comments, social networks, SMS
and WhatsApp dialogues – as well as monological web pages – such as personal
or professional blogs, Wikipedia articles, academic sites, etc.

The shared task is divided into two subtasks (A: tokenization, B: POS tagging)
and two different data sets (CMC subset, web corpora subset). While our main
goal is to foster the development of robust tools that work well on a wide
range of CMC & web genres, teams are allowed to focus on one subtask or one
subset only. Full manually annotated training data are available now on the
EmpiriST homepage, comprising approx. 5000 tokens for each subset.

Results and system descriptions will be presented in the WAC-X workshop
co-located with ACL 2016 in Berlin, Germany (11 or 12 August 2016).

For more information, including detailed annotation guidelines and
instructions for participation, see the EmpiriST homepage at

https://sites.google.com/site/empirist2015/

and join our Google group for updates, questions and discussion:

https://groups.google.com/d/forum/empirist2015

While EmpiriST is focussed on the annotation of German-language data,
familiarity with German is not essential for participating in the task. There
are sufficient amounts of training data for general machine learning, domain
adaptation and optimization approaches. We also provide an English summary of
the POS tagset and annotation
guidelines.

TASK FORCE

CMC data set:

- Michael Beißwenger (Technische Universität Dortmund)
- Kay-Michael Würzner (Berlin-Brandenburgische Akademie der Wissenschaften)

Web corpora data set:
- Sabine Bartsch (Technische Universität Darmstadt)
- Stefan Evert (Universität Erlangen-Nürnberg)

------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-27-752	
----------------------------------------------------------