[Corpora-List] WEB AS CORPUS: Workshop/Tutorial, 14th July 05, Birmingham UK

Thu Jun 9 08:51:33 UTC 2005

                        ********************************
                                 WEB AS CORPUS
                        Pre-conference workshop/tutorial
                             Corpus Linguistics 2005
                                 14th July 2005
                             Birmingham University, UK
                        *********************************

              http://sslmit.unibo.it/~baroni/web_as_corpus_cl05.html

                                    Co-chairs:
                  Marco Baroni, Sebastian Hoffmann, Adam Kilgarriff

Motivation:

The World Wide Web is a mine of language data of unprecedented richness
and ease of access (Kilgarriff and Grefenstette, 2003). A growing body of
studies has shown that simple algorithms using Web-based evidence are
successful at many linguistic tasks, often outperforming sophisticated
methods based on smaller but more controlled data sources (e.g., Turney
2001).

However, many fundamental issues about the viability and exploitation of
the web as a linguistic corpus must still be explored, or are just
starting to be tackled. These issues range from word frequency
distributions on the web to efficient handling of massive data sets, to
the legal standing of web indexing.

Thus, we believe that the research on the web as corpus is currently in a
very exciting stage: increasing evidence points to the enormous potential
of the Internet as a source of linguistic data, but we are still far
removed from anything like a working, fully-fledged tool for linguists and
language technologists to use the web as a corpus.

Contents:

This full-day workshop and tutorial will provide an introduction to the
issues involved in using the web as a corpus.  The emphasis will be
practical and participatory, with presentations of programs addressing
particular issues, and opportunities for all participants to describe their
experiences of working with the web as a source of linguistic data.  We
shall also aim to establish what main challenges lying ahead are for this
young community, and how it should work collectively to address them.

* General overview of web-as-corpus work
* Building large/general and small/special-purpose web corpora
* Web crawling for linguistic purposes
* (Near-)duplicate detection, boilerplate removal, language identification
* Linguistic annotation
* Working with non-latin1 languages
* Indexing and retrieval from large document collections
* Prospected interfaces

Provisional program:

9:30-10:00 Adam Kilgarriff (Lexicography MasterClass) - Welcome, goals of
  the workshop, overview of program
10:00-10:45 Tom Emerson (Basis Technology) - Large crawls of the web for
  linguistic purposes
10:45-11:15 coffee break
11.15-12.00 Marco Baroni (University of Bologna) and Serge Sharoff
  (University of Leeds) - Creating specialized and general corpora using
  automated search engine queries
12:00-13:00 Small groups arranged around the participants' research
  purposes

13:00-14:30 lunch break

14:30-15:15 Sebastian Hoffmann (University of Zurich) - Processing
  web-derived text (or: Working with very messy data)
15:15-16:00 Stefan Evert (University of Osnabrück) and Adam Kilgarriff
  (Lexicography MasterClass) - Indexing and interfaces
16:00-16:30 coffee break
16:30-17:00 Alexander Mehler and Rüdiger Gleim (University of Bielefeld) -
  Representing genre-specific websites
17:00-17:30 Small groups on "what are critical next steps for
  Web-as-Corpus activity?"
17:30-18:10 Plenary: where next?

Registration:

Registration and accommodation are managed by the main conference
organizers. Please visit:

http://www.corpus.bham.ac.uk/conference