OLAC Protocol for Metadata Harvesting

Wed Jan 23 14:53:26 UTC 2002

Folks,

Back in December I reported on some discussions Gary and I had concerning
<olac-archive>, the OLAC Archive Description element (see section 3 of
http://www.language-archives.org/OLAC/protocol.html).  This element
contains archive-level metadata - the data that describes the archive as a
whole.

The issue concerned support for archive descriptions in multiple languages,
and the proposed solution was to add a lang attribute to the olac-archive
element.  Multiple instances of the element would then be given, one per
language, e.g.:

  <description>
    <olac-archive lang="en" type="institutional">
      ...
      <institution>National Archives of Canada</institution>
      ...
    </olac-archive>
    <olac-archive lang="fr" type="institutional">
      ....
      <institution>Archives nationales du Canada</institution>
      ...
    </olac-archive>
  </description>

However, this extra feature adds complexity to our software:
- the databases must now keep this archive-level metadata in a
  separate table (to permit arbitrary numbers of versions)
- there needs to be best practices about consistency of content
  across the different language versions
- we need to find a way to distinguish official names from
  translations, as we already had to for alternative titles
  [http://www.language-archives.org/OLAC/olacms.html#Title]

A simpler approach is to permit a single <olac-archive> element,
and for OLAC implementers to specify archive-level metadata in
exactly the form they want it to be presented by service providers.
For example:

  <description>
    <olac-archive type="institutional">
      ...
      <institution>
        National Archives of Canada / Archives nationales du Canada
      </institution>
      ...
    </olac-archive>
  </description>

This approach conforms with the approach taken elsewhere in the protocol
document, where we have said that element content should be given in the
form that it should presented by service providers.  For example:

> If more than one person has collaborated as personal sponsors of the
> archive, then this element should contain all the names in the order and
> format the collaborators want to be cited.

We could say something similar for multiple languages:

  "If the name of the sponsoring institution is standardly given in more than
  one language, then this element should contain all the names in the order
  and format required, e.g.  National Archives of Canada / Archives
  nationales du Canada"

In this way, we are drawing a sharp distinction between item-level and
archive-level metadata.  At the item level, multiple creators, titles,
languages etc are to be separated into distinct elements, e.g.:

  <title lang="x-sil-LLU">Na tala 'uria na idulaa diana</title>
  <title refine="alternative" lang="en">The road to good reading</title>

  <creator>Bloomfield, Leonard</creator>
  <creator>Haas, Mary</creator>

Service providers will make heavy use of this structure, both in indexing
materials, and in presenting them to end-users.

At the archive level, multiple creators, titles, languages etc are
collapsed into single elements (as we saw above), and service providers can
simply use these pre-formatted text strings to present end-users with
details of the harvested archives.  We propose to add paragraph markup
<p></p> for the elements (like synopsis) which permit free text content,
so that implementers can separate content in different languages.

I hope this makes sense.  Any comments are welcomed.
Thanks,
-Steven

--
Steven.Bird at ldc.upenn.edu  http://www.ldc.upenn.edu/sb
Assoc Director, LDC; Adj Assoc Prof, CIS & Linguistics
Linguistic Data Consortium, University of Pennsylvania
3615 Market St, Suite 200, Philadelphia, PA 19104-2608