[Corpora-List] First CfP: EACL 2014 Workshop on Web as Corpus (WAC-9)

Felix Bildhauer felix.bildhauer at fu-berlin.de
Mon Nov 11 23:47:30 UTC 2013


The 9th Web as Corpus Workshop (WAC-9)

http://www.sigwac.org.uk/wiki/WAC9

Endorsed by the Special Interest Group of the ACL on Web as Corpus


The World Wide Web has become increasingly popular as a source of 
linguistic data, not only within the NLP communities, but also with 
theoretical linguists facing problems of data sparseness or data 
diversity. Accordingly, web corpora continue to gain importance, given 
their size and diversity in terms of genres/text types. However, the 
field is still new, and a number of issues in web corpus construction 
still needs much research (fundamental and applied), ranging from 
questions of corpus design (e.g., corpus composition assessment, 
sampling strategies and their relation to crawling algorithms, handling 
of duplicated material) to more technical aspects (e.g., efficient 
implementation of individual post-processing steps in document cleansing 
and linguistic annotation, or large-scale parallelization to achieve 
web-scale corpus construction). Similarly, the systematic evaluation of 
web corpora, for example in the form of task-based comparisons to 
traditional corpora, has only lately shifted into focus.

For almost a decade, the ACL SIGWAC, and especially the highly 
successful Web as Corpus (WaC) workshops have served as a platform for 
researchers interested in building and working with web-derived corpora. 
Past workshops have been co-located with major conferences on 
computational linguistics and/ or corpus linguistics (such as EACL, 
LREC, WWW, Corpus Linguistics). As part of the workshop, we will have a 
panel discussion dedicated to the planning of a shared task for WaC10 
(2015), including the nomination of organizers of the shared task. The 
tracks of the shared task will focus on the quality of web corpus 
creation tools, tools for linguistic annotation (at least lemmatization, 
possibly also POS tagging, etc.), and the quality of web corpora themselves.


CALL FOR PAPERS

As in previous years, the 9th Web as Corpus workshop (WAC-9) invites 
original contributions pertaining to all aspects of web corpora, 
including data collection, cleaning, duplicate removal, document 
filtering, linguistic post-processing, and use of web corpora in 
language technology and linguistics.

However, a major challenge in the construction of web corpora is the 
question of the quality and the evaluation of both the software used in 
the construction of web corpora as well as the corpora themselves. 
Therefore, WaC9 seeks to put special emphasis on these topics, and it 
particularly encourages submissions addressing the following points:

* noise in web corpora: normalization and implications for linguistic 
annotation (lemmatization, POS tagging, parsing, etc.)
* task-based ("extrinsic") evaluation of web corpora, especially in 
comparison to traditional corpus resources and n-gram databases (Web 1T 
5-Grams, Google Books)
* missing meta data in web corpora: enriching web corpora with data by 
automatic classification with high accuracy
* sampling strategies/ crawling algorithms and their effect on corpus 
composition/ corpus quality
* non-destructive cleaning and normalization of web data (Currently 
available web corpora have usually undergone radical cleaning procedures 
in order to produce "high-quality" data. At least for some uses of the 
data, aggressive and sometimes arbitrary removal of material in the form 
of whole documents or parts thereof can be problematic. The same is true 
for aggressive normalization of the data. To meet such problems, ways of 
cleaning and normalizing the data transparently, i.e., preserving the 
non-normalized forms, should be discussed.)


SUBMISSION DETAILS

Abstracts should be
* anonymous
* no longer than two pages (including figures and references)
* in PDF-format
* formatted according to the EACL stylesheet (templates for LaTeX and MS 
Word are available ​from http://www.eacl2014.org/files/eacl-2014-styles.zip)
* submitted via the ​START online submission system at 
https://www.softconf.com/eacl2014/WaC9/
* submitted no later than 23 January 2014


ORGANIZING COMMITTEE

     Felix Bildhauer, Freie Universität Berlin
     Roland Schäfer, Freie Universität Berlin


PROGRAMM COMMITTEE

Organizing comittee, plus

     Adrien Barbaresi, École Normale Supérieure de Lyon
     Silvia Bernardini, Università di Bologna
     Chris Biemann, Technische Universität Darmstadt
     Jesse Egbert, Northern Arizona University
     Stefan Evert, Friedrich-Alexander Universität Erlangen-Nürnberg
     Adriano Ferraresi, Università di Bologna
     William Fletcher, United States Naval Academy
     Dirk Goldhahn, Universität Leipzig
     Adam Kilgarriff, Lexical Computing Ltd.
     Anke Lüdeling, Humboldt-Universität zu Berlin
     Alexander Mehler, Goethe-Universität Frankfurt am Main
     Uwe Quasthoff, Universität Leipzig
     Paul Rayson, Lancaster University
     Serge Sharoff, University of Leeds
     Sabine Schulte, im Walde, Universität Stuttgart
     Egon Stemle, European Academy of Bolzano
     Yannick Versley, Universität Heidelberg
     Torsten Zesch, Universität Darmstadt
     Stephen Wattam, Lancaster University


IMPORTANT DATES

     11 November 2013: First Call for Workshop Papers
     12 December 2013: Second Call for Workshop Papers
     4 January 2014: Final Call for Workshop Papers
     23 January 2014: Workshop Paper Due Date
     20 February 2014: Notification of Acceptance
     3 March 2014: Camera-ready papers due
     26-27 April 2014: EACL Workshop Dates

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list