Corpora: LREC WORKSHOP : Data Architectures and Software Support for Large Corpora
Nancy M. Ide
ide at cs.vassar.edu
Tue Feb 22 17:27:26 UTC 2000
*****************************************************************
SECOND CALL FOR PAPERS
LREC WORKSHOP
DATA ARCHITECTURES AND SOFTWARE SUPPORT FOR LARGE CORPORA
May 30, 2000
ATHENS, GREECE
http://www.cs.vassar.edu/~ide/anc/lrec.html
******************************************************************
SUBMISSION DEADLINE : MARCH 7, 2000
Several software systems for linguistic annotation, search,
and retrieval of large corpora have been developed within the
natural language processing community over the past several
years, including LT-XML (Edinburgh), GATE (Sheffield), IMS
Corpus Workbench (Stuttgart), Alembic Workbench (Mitre), MATE
(Edinburgh/Odense/Stuttgart), Silfide (Loria/CNRS), SARA
(BNC), and several others. Related to and in support of this
development, there have also been efforts to develop standards
for encoding and various kinds of linguistic annotation, as
well as data architectures (e.g., TIPSTER, TalkBank)
etc. Still other developments, such as the introduction of XML
and the powerful XSL transformation language and work on
semi-structured data (e.g., the work of the Lore group at
Stanford), have also impacted the ways in which corpora and
other linguistic resources can be represented, stored, and
accessed.
Approaches to the fundamental design of the formats, data, and
tools are varied among current systems for the annotation and
exploitation of linguistic corpora. A primary reason for this
diversity is that most developers are concerned with only one
aspect of the creation/annotation/exploitation
process. However, in order to work effectively toward
commonality, the phases of the process must be considered as a
whole. This demands bringing together researchers and
developers from a variety of domains in text, speech, video,
etc., many of whom have previously had little or no contact.
This workshop is intended to bring these groups together to
look broadly at the technical issues that bear on the
development of software systems for the annotation and
exploitation of linguistic resources. The goal is to lay the
groundwork for the definition of a data and system
architecture to support corpus annotation and exploitation
that can be widely adopted within the community. Among the
issues to be addressed are:
o layered data architectures
o system architectures for distributed databases
o support for plurality of annotation schemes
o impact and use of XML/XSL
o support for multimedia, including speech and video
o tools for creation, annotation, query and access of corpora
o mechanisms for linkage of annotation and primary data
o applicability of semi-structured data models, search and query
systems, etc.
o evaluation/validation of systems and annotations
----------------------------------------------------------------------------
Submissions
Papers should be submitted in electronic form (preferably postscript,
but plain ascii, MS Word RTF, or HTML are acceptable) to
ide at cs.vassar.edu by March 7, 2000. Please include the subject line: LREC WORKSHOP
SUBMISSION : <authors' last names> -- for example, "LREC WORKSHOP
SUBMISSION: SMITH, JONES".
Organizers
Nancy Ide (contact)
Department of Computer Science
Vassar College
Poughkeepsie, New York 12604-0520 USA
Tel : +1 914 437 5988
Fax : +1 914 437 7498
ide at vassar.edu
Henry S. Thompson
Human Communication Research Centre
2 Buccleuch Place
Edinburgh EH8 9LW
SCOTLAND
Tel : +44 (131) 650 4440
Fax : +44 (131) 650 4587
ht at cogsci.ed.ac.uk
Program Committee
Steven Bird, Linguistic Data Consortium
Patrice Bonhomme, LORIA/CNRS
Roy Byrd, IBM Corporation
Jean Carletta, HCRC Edinburgh
Ulrich Heid, IMS Stuttgart
Hamish Cunningham, Sheffield
David Day, Mitre Corporation
Robert Gaizauskas, Sheffield
Ralph Grishman, New York University
Nancy Ide, Vassar College (Chair)
Masato Ishizaki, JAIST
Dan Jurafsky, University of Colorado at Boulder
Tony McEnery, Lancaster
David McKelvie, HCRC Edinburgh
Laurent Romary, LORIA/CNRS
Gary Simons, Summer Institute of Linguistics
Henry Thompson, HCRC Edinburgh
Yorick Wilks, Sheffield
Peter Wittenburg, Max Planck Institute
Remi Zajac, New Mexico State University
More information about the Corpora
mailing list