12.327, FYI: Workshop: Web-Based Lang Documentation & OLAC

Thu Feb 8 23:03:53 UTC 2001

LINGUIST List:  Vol-12-327. Thu Feb 8 2001. ISSN: 1068-4875.

Subject: 12.327, FYI: Workshop: Web-Based Lang Documentation & OLAC

Moderators: Anthony Aristar, Wayne State U.<aristar at linguistlist.org>
            Helen Dry, Eastern Michigan U. <hdry at linguistlist.org>
            Andrew Carnie, U. of Arizona <carnie at linguistlist.org>

Reviews (reviews at linguistlist.org):
	Simin Karimi, U. of Arizona
	Terence Langendoen, U. of Arizona

Editors (linguist at linguistlist.org):
	Karen Milligan, WSU 		Naomi Ogasawara, EMU
	Lydia Grebenyova, EMU		Jody Huellmantel, WSU
	James Yuells, WSU		Michael Appleby, EMU
	Marie Klopfenstein, WSU		Ljuba Veselinova, Stockholm U.

Software: John Remmers, E. Michigan U. <remmers at emunix.emich.edu>
          Gayathri Sriram, E. Michigan U. <gayatri at linguistlist.org>

Home Page:  http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: Lydia Grebenyova <lydia at linguistlist.org>

=================================Directory=================================

1)
Date:  Thu, 01 Feb 2001 12:57:25 EST
From:  "J. Albert Bickford" <albert_bickford at sil.org>,
       Steven Bird <sb at ldc.upenn.edu>
Subject:  Workshop Report: Web-Based Lang Documentation & Description, & OLAC

-------------------------------- Message 1 -------------------------------

Date:  Thu, 01 Feb 2001 12:57:25 EST
From:  "J. Albert Bickford" <albert_bickford at sil.org>,
       Steven Bird <sb at ldc.upenn.edu>
Subject:  Workshop Report: Web-Based Lang Documentation & Description, & OLAC

Workshop Report

Workshop on Web-Based Language Documentation and Description,
and the Open Language Archives Community

J. Albert Bickford
SIL-Mexico and University of North Dakota
albert_bickford at sil.org

The Workshop on Web-Based Language Documentation and Description (December
12-15, 2000, University of Pennsylvania) brought together linguists,
archivists, software developers, publishers and funding agencies to discuss
how best to publish information about language on the internet. This
workshop, together with the Open Language Archives Community which is
developing out of it, seem especially important in providing useful
information about linguistics and less-commonly studied languages for both
scholars and the wide general audience that can be found on the web. I hope
that this report will be useful in understanding these new developments in
the linguistics publishing and archiving field.

The aim of the workshop was to establish an infrastructure for electronic
publishing that simultaneously addresses the needs of users (including
scholars, language communities, and the general public), creators,
archivists, software developers, and funding agencies. Such an
infrastructure would ideally meet a number of requirements important to
these different stakeholders, such as:

* provide a single entry point on the internet through which all materials
  can be easily located, regardless of where they are stored (on the
  internet or in a traditional archive). Essentially, this would be a
  massive union catalog of the whole internet and beyond.

* identify every language uniquely and precisely, so that all materials
  relevant to a particular language can be located

* make available software for creating, using, and archiving data
  (especially data in special formats); this includes software to help
  convert data from older formats to newer ones

* serve as a forum for giving and receiving advice about software, archiving
  practices, and related matters

* provide opportunity for comments and reviews of materials published within
  the system

The workshop was organized by Steven Bird (University of Pennsylvania) and
Gary Simons (SIL International).[1]  It included approximately 40
presentations and several working sessions on a variety of topics.

There was general agreement among the participants that a system for
organizing the wealth of language-related material on the internet is
needed, and that an appropriate way to establish one is by following the
guidelines of the Open Archives Initiative (OAI)
[http://www.openarchives.org/]. (These guidelines provide a general
framework for creating systems like this for specific scholarly
communities.) An OAI publishing and archiving system contains the following
elements:

* data providers, which house the materials that are indexed in the system

* a standardized set of cataloguing information for describing each of the
  materials, also known as "metadata" (i.e., data about data)

* service providers, which collect the metadata from all the data providers
  and allow users to search it in various ways so as to locate materials of
  interest to them

In the case of linguistics, the system will be known as the Open Language
Archives Community (OLAC). The Linguist list [http://www.linguistlist.org/]
has agreed to serve the system as its primary service provider. It will be
the main source that people will use to find materials through the
system. Further information about OLAC can be found at
[http://www.language-archives.org/]. The agreement to establish OLAC is
probably the most important accomplishment of the workshop.

This agreement was solidified through working sessions which met during the
workshop and started the process of working through the details in various
areas, such as:

* Character encoding: Unicode, fonts, character sets, etc.

* Data structure for different types of data (lexicons, annotated text, etc.)

* Metadata (cataloguing information that should be common to the whole
  community and how it should be represented in the computer) and other
  concerns of archivists

* Ethics, especially the responsibilities that archivists and publishers
  have to language communities

* Expectations of users, creators (e.g. authors), software developers

These and other issues will continue to be discussed on email lists in the
coming months, ultimately culminating in recommendations for "best practice" in
each area, together with a preliminary launch of the whole system, hopefully
within a year. (Prototypes of the system are available now at the OLAC address
above, along with various planning documents.)

There were also a number of conference papers, which provided a foundation for
making the working sessions productive. Rather than list or review all the
presentations here, I will summarize them, since they are all available on the
conference website [http://www.ldc.upenn.edu/exploration/expl2000/]. The topics
covered included the following:

* Proposals for various aspects of the OLAC system

* Concerns of various stakeholders, such as archivists, sponsors, language
  communities

* Descriptions and demonstrations of specific software, research projects,
  and web publishing systems

* Metadata and metadata standards

* Technical issues, such as Unicode, the OAI, sorting, data formats for
  different types of language materials (e.g. dictionaries, annotated text,
  example sentences in linguistic papers, and audio)

One insight that I gleaned from these presentations was a better understanding
of glossed interlinear text. Interlinear text is not a type of data, but rather
just one possible way of displaying an annotated text. The annotations on a
text can consist of many types of information: alternate transcriptions,
morpheme glosses, word glosses, free translations, syntactic structure (and
possibly several alternative tree structures for the same text), discourse
structure, audio and video recordings, footnotes and commentary on various
issues, etc. What ties them all together is a "timeline" that proceeds from the
beginning to the end of a text, to which different types of information are
anchored. Aligned interlinear glosses are one way of displaying some of this
information, but not the only way, and not even the most appropriate way for
some types of information. The traditional arrangement of Talmudic material,
for example, with the core text in the center of the page and commentary around
the edges, is another possible display of annotated text, in which the
annotations are associated more with whole sentences and paragraphs than with
individual morphemes. There are also some sophisticated examples available for
presenting audio alongside interlinear text. (For example, check out the LACITO
archive [ http://lacito.archivage.vjf.cnrs.fr/ ]!)

Throughout, it was very clear that those at the conference had a great deal in
common with each other:

* a primary concern for descriptive (as distinguished from theoretical) linguistics

* a desire to make language materials available, to communities of speakers
  and the general public as well as scholars

* an interest in taking advantage of the Internet, which provides a means
  of publishing such materials that by-passes the limitations of
  traditional publication (since the costs are so much lower, and thus
  appropriate for materials that have smaller audiences)

* awareness that many materials may be less than fully-polished yet still
  valuable to some people and worth archiving

* a sense of frustration with the currently confused state of the art in
  data formats, especially fonts and character encoding, and the lack of
  good information about how best to archive and publish on the web

* awareness of the large amount of data that is in data formats which will
  be obsolete in a few years (and thus a willingness to accept data in
  whatever form it is in, while also seeing a need for software to help
  convert data to newer formats)

* a strong suspicion toward and distrust of rigid requirements, yet a
  willingness to adopt standards voluntarily when their usefulness has been
  demonstrated

Finally, the conference pointed out several trends that will be increasingly
important in future years.

* The speakers of lesser-known languages will be more actively involved the
  production and use of materials in and about their languages, and their
  concerns will increasingly have to be considered by scholars. These
  include carefully documenting permissions and levels of access to
  materials, making sure that language materials are available to the
  communities themselves, and being careful that scholars do not
  inadvertently aid commercial interests in exploiting native
  knowledge-systems (such as medicinal use of plants) without appropriate
  compensation.

* The boundary between publishing, libraries, and archiving is being
  blurred by the shift to the digital world. Materials can be "archived" on
  the web, which is a type of publication. Electronic "libraries" are
  springing up in many places. Published and unpublished works from around
  the world can be listed together in one common catalog. The same
  technology is important in both spheres of activity. In short, these
  activities are merging under a new umbrella that could be called
  "scholarly information management". A corollary to this trend is that
  archiving is not just something done at the end of a research project;
  it's part of the ongoing process of managing the information that the
  project produces.

* In such a world, and with huge numbers of resources available to sift
  through, metadata becomes increasingly important. A freeform paragraph
  description in a publications catalog is no longer good enough. It is the
  metadata that users will consult in order to find materials of interest
  to them, so the metadata must be carefully structured, accurate and
  current. More and more, scholars will have to think not just about
  producing materials but also about how to describe them so as to make
  them accessible to others.

* Unicode [http://www.unicode.org/] is the way of the future for
  representation of special characters in computers. The days of special
  fonts for each language project are numbered. Instead, Unicode will make
  possible a single set of fonts that meets virtually everyone's needs in
  the same package. Over the next few years, most people will be switching
  their computers over to using Unicode almost exclusively (that is, if
  they want to take advantage of newer software).

* Language data will increasingly need to be structured carefully so that
  not only can people view it and use it, but machines will be able to
  understand and manipulate it in various ways. This will most likely be
  done using XML (Extensible Markup Language) which is already
  widely-supported in the computer industry, with more support becoming
  available regularly.[2]

All in all, it was a workshop that was both stimulating and practical, one
which will have an unusual amount of influence in months and years to come.

Footnotes:

[1] Funding was provided by the Institute for Research in Cognitive Science
    (IRCS) of the University of Pennsylvania, the International Standards in
    Language Engineering Spoken Language Group (ISLE), and Talkbank.

[2] Since XML's development has been closely-associated with the World Wide
    Web consortium [http://www.w3.org/XML/], it has been widely regarded as
    the successor to HTML for web pages. However, this is just a small part
    of its usefulness; it is a general-purpose system for representing the
    structure of information in a document or database, which can be
    customized for myriads of purposes. Many software tools are currently
    available for creating and manipulating data in XML, with more being
    created all the time. One, Extensible Stylesheet Language
    Transformations [http://www.w3.org/TR/xslt], can do complex
    restructuring of XML data.

(This report will be published in Notes on Linguistics,
http://www.sil.org/linguistics/NOL.htm)

---------------------------------------------------------------------------
LINGUIST List: Vol-12-327