Corpora: LREC WORKSHOP : Data Architectures and Software Support for Large Corpora

Nancy M. Ide ide at
Tue Feb 22 17:27:26 UTC 2000

                           SECOND CALL FOR PAPERS

                               LREC WORKSHOP


                               May 30, 2000
                              ATHENS, GREECE



                     SUBMISSION DEADLINE : MARCH 7, 2000

        Several software systems for linguistic annotation, search,
        and retrieval of large corpora have been developed within the
        natural language processing community over the past several
        years, including LT-XML (Edinburgh), GATE (Sheffield), IMS
        Corpus Workbench (Stuttgart), Alembic Workbench (Mitre), MATE
        (Edinburgh/Odense/Stuttgart), Silfide (Loria/CNRS), SARA
        (BNC), and several others. Related to and in support of this
        development, there have also been efforts to develop standards
        for encoding and various kinds of linguistic annotation, as
        well as data architectures (e.g., TIPSTER, TalkBank)
        etc. Still other developments, such as the introduction of XML
        and the powerful XSL transformation language and work on
        semi-structured data (e.g., the work of the Lore group at
        Stanford), have also impacted the ways in which corpora and
        other linguistic resources can be represented, stored, and

        Approaches to the fundamental design of the formats, data, and
        tools are varied among current systems for the annotation and
        exploitation of linguistic corpora. A primary reason for this
        diversity is that most developers are concerned with only one
        aspect of the creation/annotation/exploitation
        process. However, in order to work effectively toward
        commonality, the phases of the process must be considered as a
        whole. This demands bringing together researchers and
        developers from a variety of domains in text, speech, video,
        etc., many of whom have previously had little or no contact.

        This workshop is intended to bring these groups together to
        look broadly at the technical issues that bear on the
        development of software systems for the annotation and
        exploitation of linguistic resources. The goal is to lay the
        groundwork for the definition of a data and system
        architecture to support corpus annotation and exploitation
        that can be widely adopted within the community. Among the
        issues to be addressed are:

           o layered data architectures
           o system architectures for distributed databases
           o support for plurality of annotation schemes
           o impact and use of XML/XSL
           o support for multimedia, including speech and video
           o tools for creation, annotation, query and access of corpora
           o mechanisms for linkage of annotation and primary data
           o applicability of semi-structured data models, search and query
             systems, etc.
           o evaluation/validation of systems and annotations



Papers should be submitted in electronic form (preferably postscript,
but plain ascii, MS Word RTF, or HTML are acceptable) to
ide at by March 7, 2000. Please include the subject line: LREC WORKSHOP
SUBMISSION : <authors' last names> -- for example, "LREC WORKSHOP


       Nancy Ide (contact)
       Department of Computer Science
       Vassar College
       Poughkeepsie, New York 12604-0520 USA
       Tel : +1 914 437 5988
       Fax : +1 914 437 7498
       ide at

       Henry S. Thompson
       Human Communication Research Centre
       2 Buccleuch Place
       Edinburgh EH8 9LW
       Tel : +44 (131) 650 4440
       Fax : +44 (131) 650 4587
       ht at

Program Committee

       Steven Bird, Linguistic Data Consortium
       Patrice Bonhomme, LORIA/CNRS
       Roy Byrd, IBM Corporation
       Jean Carletta, HCRC Edinburgh
       Ulrich Heid, IMS Stuttgart
       Hamish Cunningham, Sheffield
       David Day, Mitre Corporation
       Robert Gaizauskas, Sheffield
       Ralph Grishman, New York University
       Nancy Ide, Vassar College (Chair)
       Masato Ishizaki, JAIST
       Dan Jurafsky, University of Colorado at Boulder
       Tony McEnery, Lancaster
       David McKelvie, HCRC Edinburgh
       Laurent Romary, LORIA/CNRS
       Gary Simons, Summer Institute of Linguistics
       Henry Thompson, HCRC Edinburgh
       Yorick Wilks, Sheffield
       Peter Wittenburg, Max Planck Institute
       Remi Zajac, New Mexico State University

More information about the Corpora mailing list