[Corpora-List] 2nd CfP: Corpus Analysis with Noise in the Signal (CANS 2013) workshop

Alistair Baron a.baron at comp.lancs.ac.uk
Mon Jan 21 16:34:32 UTC 2013


*Call for Papers*
*
*
*Workshop: Corpus Analysis with Noise in the Signal (CANS 2013)*
*at Corpus Linguistics 2013 conference (CL2013), Lancaster University, UK.*
*22nd July 2013*
*
*
*http://ucrel.lancs.ac.uk/cans2013/
*
*
*
*Submission deadline: 22nd February 2013*
*
*
Whilst many widely-used corpora include mainly standard written text on
which a range of automatic corpus analysis and Natural Language Processing
(NLP) techniques can be accurately performed, an increasing number of
corpora contain substantial amounts of noisy textual data and irregular
language. Such corpora range from relatively small specialised historical
corpora (e.g. Early Modern English Medical Texts (EMEMT)) and second
language learner corpora (e.g. French Learner Language Oral Corpora
(FLLOC)) to very large datasets such as the transcribed Early English Books
Online collection (EEBO-TCP), large collections of OCRed books (e.g. from
Google Books) and the very large corpora being crawled from the web (e.g.
from Twitter, and Web as Corpus). These non-standard language varieties can
cause significant issues for corpus analysis tools, which in the majority
of cases are set up and trained to deal with clean standard texts.

Our response to some of these issues has been the development of a Variant
Detector tool (VARD2 <http://ucrel.lancs.ac.uk/vard>). Originally developed
to normalize spelling variants within historical English datasets, VARD2
has since been adapted for use with SMS, Twitter, child language, learner
corpora, other languages, etc. The purpose of this workshop is to provide a
format in which we can discuss - and compare - our approach with other
researchers' approaches to noise. This may include work where researchers
have used and adapted VARD2, or utilise new tools and methods.

We invite submissions to present research highlighting the impact of noisy
textual data on corpus-based research and/or providing methods to negate
the effect of such noise. We are interested in research concerning any
corpora with substantial textual noise and are particularly keen to have a
range of languages and noise sources represented at the workshop.

Noise sources may include but are not limited to:

   - Historical spelling variation
   - Computer-mediated language varieties (e.g. chatroom, SMS, social
   networks, blogs, Twitter, etc.)
   - First and second language learner corpora
   - Inaccurately digitised texts, e.g. badly OCRed or badly transcribed
   corpora
   - Idiosyncratic language usage/idiolect features


Topics of interest include but are not limited to:

   - Evaluations of established corpus analysis methodology when processing
   noisy corpora.
   - Methods for pre-processing noise in corpora, such as spelling
   normalisaton and error correction.
   - Development of noise-aware corpus analysis methods which are robust
   enough to deal with noisy corpora and process them with accuracy, e.g. new
   automatic part-of-speech taggers.
   - Analyses of the characteristics and trends of spelling variation and
   language irregularities.
   - Studies which highlight the importance of maintaining original
   spellings and language irregularities and how these can assist in some
   aspects of corpus analysis.


Two types of submissions are sought, full paper presentations and shorter
work-in-progress reports. For full papers we require an extended abstract
of 1,000-2,000 words. For work-in-progress reports we require shorter
abstracts of 500-1,000 words. The deadline for submitting abstracts is *22nd
February 2013*, they will then be reviewed by the organising committee and
you will receive a response by 11th March 2013. The organising committee
consists of:

   - Alistair Baron (Lancaster University)
   - Paul Rayson (Lancaster University)
   - Dawn Archer (University of Central Lancashire)


Papers should be submitted to cans2013 at comp.lancs.ac.uk, and should use the
same guidelines and template as those for the main Corpus Linguistics 2013
conference, with the exception of text length restrictions. Further
instructions for submission are provided on the workshop
website<http://ucrel.lancs.ac.uk/cans2013/#submission>
.

Accepted full papers will be allocated 20 minutes + 5 minutes for
questions, accepted work-in progress reports will be allocated 10 minutes +
5 minutes for questions. The remaining time will include an open discussion
of the papers presented and general topics such as:

   - What are the key challenges of dealing with noisy textual data going
   forward?
   - When should we leave "noise" where it is? And for what reason(s)?
   - What are the dangers of ignoring the noise?


We expect to select papers from the workshop for a peer-reviewed journal
special issue.

*In line with the policy of the conference organisers, you are welcome to
submit abstracts both for this workshop and for the main Corpus Linguistics
2013 conference. However, if you give two papers they should be different,
without substantial overlap.*

-- 
Dr. Alistair Baron
Faculty Research Fellow
Security Lancaster
School of Computing and Communications
Infolab21
Lancaster University
Lancaster LA1 4WA
UK

O: B61 Infolab21
T: +44 (0)1524 510519 (temp.)
E: a.baron at lancs.ac.uk <a.baron at comp.lancs.ac.uk>
W: http://www.comp.lancs.ac.uk/~barona
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130121/b7b98d1e/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list