15.3194, FYI: ERF 2004 Award; Web as Corpus at CL 2005

Sat Nov 13 22:44:18 UTC 2004

LINGUIST List: Vol-15-3194. Sat Nov 13 2004. ISSN: 1068 - 4875.

Subject: 15.3194, FYI: ERF 2004 Award; Web as Corpus at CL 2005

Moderators: Anthony Aristar, Wayne State U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org)
        Sheila Collberg, U of Arizona
        Terry Langendoen, U of Arizona

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: Ann Sawyer <sawyer at linguistlist.org>
================================================================

To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.

===========================Directory==============================

1)
Date: 13-Nov-2004
From: Julian Bamford < bamford at shonan.bunkyo.ac.jp> >
Subject: ERF 2004 Language Learner Literature Award

2)
Date: 12-Nov-2004
From: Marco Baroni < baroni at sslmit.unibo.it >
Subject: The Web as Corpus at CL 2005

-------------------------Message 1 ----------------------------------
Date: Sat, 13 Nov 2004 17:35:26
From: Julian Bamford < bamford at shonan.bunkyo.ac.jp> >
Subject: ERF 2004 Language Learner Literature Award

The Extensive Reading Foundation 2004 Language Learner Literature Award

The Extensive Reading Foundation (ERF) is a new, unaffiliated, not-for
profit group to support and promote extensive reading in language
education.  The ERF has established an award for language learner
literature (graded readers) in English.

The winning books for 2004 will be announced on November 20 at
the Japan Association for Language Teaching International
Conference (JALT 2004) in Nara.

For more information about the ERF and the Language Learner
Literature Award, visit  erfoundation.org.

Note: We will notify you of the winning books shortly after November 21.
However, if you publish after November 21 but need copy immediately,
we can send you advanced notice of the winning books.
Write to bamford at shonan.bunkyo.ac.jp.

Richard Day, Extensive Reading Foundation Chair
rday at hawaii.edu

Linguistic Field(s): Ling & Literature

-------------------------Message 2 ----------------------------------
Date: Sat, 13 Nov 2004 17:35:28
From: Marco Baroni < baroni at sslmit.unibo.it >
Subject: The Web as Corpus at CL 2005

Planned colloquium on 'The Web as a Corpus' at Corpus Linguistics 2005

Motivation

The World Wide Web is a mine of language data of unprecedented richness
and ease of access (Kilgarriff and Grefenstette 2003). A growing body of
studies has shown that simple algorithms using Web-based evidence are
successful at many linguistic tasks, often outperforming sophisticated methods
based on smaller but more controlled data sources (e.g., Turney 2001,
Keller and Lapata 2003), despite the many peculiarites of data that might
be used in this way.

Current Internet-based linguistic studies differ in terms of strategies used to
access Web data. For example, some researchers collect frequency data directly
from commercial search engines (e.g., Turney 2001). Others use a search engine
to find relevant pages, and then retrieve the pages to build a corpus (e.g.,
Ghani et al. 2001, Baroni and Bernardini 2004). Others yet build a corpus by
spidering the web and manage the data with an ad-hoc search engine (e.g., Terra
and Clarke 2003).

Different approaches have also been proposed to the task of sharing web-derived
data. For example, some researchers make web-mining tools available (e.g.,
Fletcher 2000, Baroni and Bernardini 2004) while others provide URL lists that
allow users to construct web-corpora (e.g., Ghani et al. 2001, Resnik and Smith
2003), and others yet have proposed prototypes of Internet search engines for
the linguists' community (Kehoe and Renouf 2002, Fletcher 2002, Kilgarriff 2003,
Resnik and and Elkiss 2003).

Many fundamental issues about the viability and exploitation of the web as a
linguistic corpus must still be explored, or are just starting to be tackled.
Some of these issues are of theoretical interest, such as word frequency
distributions and topical biases in Internet documents, while other pertain to
equally important implementational and practical aspects, such as efficient
handling of massive data sets and the legal standing of indexing for linguistic
purposes.

Thus, we believe that the research on the web as corpus is currently in a very
exciting stage: increasing evidence points to the enormous potential of the
Internet as a source of linguistic data, but we are still far removed from
anything like a working, fully-fledged linguist's search engine.

CALL FOR EXPRESSIONS OF INTEREST

We are planning a colloquium to be held at Corpus Linguistics 2005 (Birmingham,
UK, 14-17 July 2005) in which scholars using (or planning to use) the web as a
corpus can meet to share experiences and plans.

Anybody interested in actively participating in the event should fill up the
online expression-of-interest at the address specified below, as soon as
possible, and in any case by DECEMBER 14 2004, to give us time to prepare the
official colloquium proposal to be submitted for review (deadline for submission
of colloquium proposals: January 14 2005).

We will get in touch with those who submitted expressions of interest as soon as
possible, and in any case by early January 2005.

WEB-AS-CORPUS COLLOQUIUM ORGANIZERS

Adam Kilgarriff (Lexicography MasterClass)
Marco Baroni (University of Bologna)

WEB-AS-CORPUS EXPRESSION OF INTEREST FORM

http://sslmit.unibo.it/~baroni/web_as_corpus_cl05.html

CORPUS LINGUISTICS 2005 WEBSITE

http://www.corpus.bham.ac.uk/conference/

Linguistic Field(s): Computational Linguistics; Text/Corpus Linguistics

-----------------------------------------------------------
LINGUIST List: Vol-15-3194