16.1843, Confs: Text/Corpus Ling/Birmingham, UK

Sat Jun 11 15:42:16 UTC 2005

LINGUIST List: Vol-16-1843. Sat Jun 11 2005. ISSN: 1068 - 4875.

Subject: 16.1843, Confs: Text/Corpus Ling/Birmingham, UK

Moderators: Anthony Aristar, Wayne State U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org) 
        Sheila Dooley, U of Arizona  
        Terry Langendoen, U of Arizona  

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: Amy Wronkowicz <amy at linguistlist.org>
================================================================  

To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.

===========================Directory==============================  

1)
Date: 09-Jun-2005
From: Sebastian Hoffmann < sebhoff at es.unizh.ch >
Subject: Web as Corpus Workshop/Tutorial (CL2005) 

-------------------------Message 1 ---------------------------------- 
Date: Sat, 11 Jun 2005 11:31:54
From: Sebastian Hoffmann < sebhoff at es.unizh.ch >
Subject: Web as Corpus Workshop/Tutorial (CL2005) 

Web as Corpus Workshop/Tutorial (CL2005) 

Date: 14-Jul-2005 - 14-Jul-2005 
Location: Birmingham, United Kingdom 
Contact: Sebastian Hoffmann 
Contact Email: sebhoff at es.unizh.ch 
Meeting URL: http://sslmit.unibo.it/~baroni/web_as_corpus_cl05.html 

Linguistic Field(s): Text/Corpus Linguistics 

Meeting Description: 

WEB AS CORPUS
Pre-conference workshop/tutorial
Corpus Linguistics 2005
14th July 2005
Birmingham University, UK

http://sslmit.unibo.it/~baroni/web_as_corpus_cl05.html

Co-chairs:
Marco Baroni, Sebastian Hoffmann, Adam Kilgarriff

Motivation:

The World Wide Web is a mine of language data of unprecedented richness and ease
of access (Kilgarriff and Grefenstette, 2003). A growing body of studies has
shown that simple algorithms using Web-based evidence are successful at many
linguistic tasks, often outperforming sophisticated methods based on smaller but
more controlled data sources (e.g., Turney 2001).

However, many fundamental issues about the viability and exploitation of the web
as a linguistic corpus must still be explored, or are just starting to be
tackled. These issues range from word frequency distributions on the web to
efficient handling of massive data sets, to the legal standing of web indexing.

Thus, we believe that the research on the web as corpus is currently in a very
exciting stage: increasing evidence points to the enormous potential of the
Internet as a source of linguistic data, but we are still far removed from
anything like a working, fully-fledged tool for linguists and language
technologists to use the web as a corpus.

Contents:

This full-day workshop and tutorial will provide an introduction to the issues
involved in using the web as a corpus.  The emphasis will be practical and
participatory, with presentations of programs addressing particular issues, and
opportunities for all participants to describe their experiences of working with
the web as a source of linguistic data.  We shall also aim to establish what
main challenges lying ahead are for this young community, and how it should work
collectively to address them.

* General overview of web-as-corpus work
* Building large/general and small/special-purpose web corpora
* Web crawling for linguistic purposes
* (Near-)duplicate detection, boilerplate removal, language identification
* Linguistic annotation
* Working with non-latin1 languages
* Indexing and retrieval from large document collections
* Prospected interfaces 

Provisional program:

9:30-10:00 Adam Kilgarriff (Lexicography MasterClass) - Welcome, goals of the
workshop, overview of program
10:00-10:45 Tom Emerson (Basis Technology) - Large crawls of the web for
linguistic purposes
10:45-11:15 coffee break
11.15-12.00 Marco Baroni (University of Bologna) and Serge Sharoff (University
of Leeds) - Creating specialized and general corpora using automated search
engine queries
12:00-13:00 Small groups arranged around the participants' research purposes

13:00-14:30 lunch break

14:30-15:15 Sebastian Hoffmann (University of Zurich) - Processing web-derived
text (or: Working with very messy data)
15:15-16:00 Stefan Evert (University of Osnabrück) and Adam Kilgarriff
(Lexicography MasterClass) - Indexing and interfaces
16:00-16:30 coffee break
16:30-17:00 Alexander Mehler and Rüdiger Gleim (University of Bielefeld) -
Representing genre-specific websites
17:00-17:30 Small groups on ''what are critical next steps for Web-as-Corpus
activity?''
17:30-18:10 Plenary: where next?

Registration:

Registration and accommodation are managed by the main conference organizers.
Please visit:

http://www.corpus.bham.ac.uk/conference

-----------------------------------------------------------
LINGUIST List: Vol-16-1843