A simpler format for OLAC vocabularies and schemes

Steven Bird sb at CS.MU.OZ.AU
Thu Oct 31 06:41:07 UTC 2002


About six weeks ago, Gary Simons and I presented a schematic outline
for a new representation for OLAC metadata.  We described a single
extension mechanism that would provide better interoperability and
extensiblity, with less administrative and technical infrastructure
than before, with the goal of making it still easier for archives to
participate in OLAC.

About the same time we discovered very recent DCMI work on the XML
representation of DC and DC qualifiers:

  Guidelines for implementing Dublin Core in XML
  http://dublincore.org/documents/2002/09/09/dc-xml-guidelines/

  Recommendations for XML Schema for Qualified Dublin Core
  http://www.ukoln.ac.uk/metadata/dcmi/xmlschema/20021007/

These documents finally provide the DC XML framework that we had hoped
to find way back in January 2001, when we first started working on an
XML representation of our own Dublin Core qualifiers.

In the intervening six weeks we have figured out a new format for OLAC
metadata which implements our simplified extension mechanism, while
simultaneously re-using the new schemas from the DCMI.


REVIEW

To recap briefly, here are three examples showing OLAC 0.4 metadata,
the version in current use:

  <subject.language code="x-sil-BAN">Dschang</subject.language>
  <language scheme="AS-Formosan">Seediq</language>
  <contributor refine="editor">Sapir, Ned</contributor>

The examples illustrate several points:
(a) Element refinement: subject.language, editor (i.e. two different methods)
(b) OLAC encoding scheme: code="xxx"
(c) Free text element content, the escape hatch when OLAC codes don't fit
(d) A third party encoding scheme: scheme="xxx"

Here's the same information represented according to last month's
proposal for a simplified extension mechanism:

  <subject extension="OLAC-Language" code="x-sil-BAN">Dschang</subject>
  <language extension="AS-Formosan" code="Seediq"/>
  <contributor extension="OLAC-Role" code="editor">Sapir, Ned</contributor>

According to our proposal, this extension attribute would be used to
express all refinements, vocabularies and schemes, whether originating
from OLAC, an OLAC subcommunity, or an individual archive.  These
extensions wouldn't be centrally controlled, so individual archives
and groups of archives could develop their own extensions without any
community-wide approval process, and later demonstrate useful services
based on their extension in order to promote it to the community at
large.


REVISED REPRESENTATION

In the revised representation we are now proposing, the "extension"
attribute is renamed "xsi:type", and its value is given a namespace
prefix.  For example, the above three elements would be rewritten as
follows:

  <subject xsi:type="olac:language" code="x-sil-BAN">Dschang</subject>
  <language xsi:type="as:formosan" code="Seediq"/>
  <contributor xsi:type="olac:role" code="editor">Sapir, Ned</contributor>

This little change brings us into line with DCMI.  No longer do we
have to define DC and DC qualifiers ourselves, we can now simply
import the DCMI Schemas directly.  This means that OLAC metadata is
not simply a semantic extension of DC metadata as in the past, but the
OLAC metadata *format* is a *syntactic* extension of the DC metadata
format.


THE FILES

The schemas are posted at:
http://www.language-archives.org/OLAC/1.0b1/

The contents of the directory are as follows:

1. Example metadata record
* olac.xml

2. Top level OLAC schema
* olac.xsd

3. OLAC vocabularies (subject to approval at the December workshop)
* olac-date.xsd
* olac-language.xsd
* olac-linguistic-field.xsd
* olac-linguistic-type.xsd
* olac-role.xsd

4. Hypothetical third-party extensions (to be hosted off-site)

a) Academia Sinica Formosan language vocabulary
* third-party/as-formosan.xml
* third-party/as-formosan.xsd

b) LT-World Human Language Technology vocabulary
* third-party/ltworld-hlt-field.xml
* third-party/ltworld-hlt-field.xsd

c) Individual archive's own redefined OLAC vocabularies
* third-party/myolac.xml
* third-party/myolac.xsd

d) Networking Data Centers' vocabulary (LDC/ELRA)
* third-party/netdc.xml
* third-party/netdc.xsd

e) Software vocabularies
* third-party/software.xml
* third-party/software-cpu.xsd
* third-party/software-os.xsd
* third-party/software-sourcecode.xsd
* third-party/software.xsd

f) An example mixing three independent extensions
* third-party/combined.xml


TECHNICAL DISCUSSION

(a) About xsi:type

The xsi:type attribute is defined in the XML Schema standard. It is a
directive to a schema validator, telling it to override the definition
of the XML element with the named type definition. It uses the
namespace declaration to find the schema fragment that defines the
overriding type.  Thus, the attribute xsi:type="olac:language" says:
"take the DC definition of subject, add an optional "code" attribute,
and restrict the code values to the range specified in the schema for
olac:language.

(b) Harvesting

When harvesting these records, OLAC service providers will store OLAC
and third-party metadata elements in the same way, using columns for
the extension name (i.e. the value of the xsi:type attribute), for the
code, and for the element content.  In this way, coded values and
element content will be searchable for both OLAC and third-party
vocabularies alike.  However, only OLAC vocabularies would have
special services associated with them (e.g. the language codes service
built into the LINGUIST service provider).  The proposer of a new
extension could set up their own service provider to demonstrate the
value of their vocabulary in resource discovery and promote it to the
whole OLAC community.

(c) Dumb-down

Dumb-down from a third-party extension to OLAC, and dumb-down from
OLAC to DC, are straightforward to implement in this model.  Full
details will be circulated in a later message.

(d) Application profiles

An "application profile" is a hybrid metadata record that combines
elements and attributes that come from multiple authorities [1,2].
Under the newly proposed approach, we can conceive of OLAC metadata as
an application profile for the language resources community.
When a third party wants to extend the OLAC application profile, they
are actually creating a new application profile that combines DC and OLAC
metadata elements and attributes, along with their own.

[1] http://www.ariadne.ac.uk/issue25/app-profiles/
[2] http://dublincore.org/documents/library-application-profile/

(e) Copying the DCMI use of XML schemas

The decision to copy the DCMI's use of XML Schemas has two unfortunate
and unavoidable consequences.  First, the XML representation of DC and
OLAC metadata is tied to XML Schema validation.  If the validation
technology is ever changed, then the metadata format will need to be
changed.  Second, the xsi:type declarations are not constrained as to
which DC element they appear on.  If a metadata record used the role
vocabulary on an inappropriate element such as title, then the schema
validation would not report this error.

These are problems with the implementation decisions made by the
DC-Architecture Working Group, problems that we inherit.  We feel that
it is more important to conform to the DCMI and work with them to
address these issues, rather than continuing to work in isolation.

(f) Preserving a simple migration path

The new proposal maintains the simple migration path that is currently
permitted with OLAC 0.4.  This is an important feature for new
archives coming in to OLAC.  The following sequence illustrates the
migration path:

Step 1: archive maps their topic descriptor to the DC subject element:
  <subject>prosody</subject>

Step 2: archive uses the OLAC extension as a refinement, to state that
  the element content pertains to a linguistic field:
  <subject xsi:type="olac:linguistic-field">prosody</subject>

Step 3a: archive identifies the nearest OLAC code but retains
  their own data as a comment, to provide additional information:
  <subject xsi:type="olac:linguistic-field" code="phonology">prosody</subject>

OR
Step 3b: archive persuades community to accept a new vocabulary item:
  <subject xsi:type="olac:linguistic-field" code="prosody"/>

Note that step 3a illustrates an escape hatch for archives that have a
problem mapping their descriptors to OLAC vocabulary items.

Note also that this approach represents a minor deviation from the
DCMI approach, which puts coded values in the element content, leaving
no room for comments.


CONCLUSION

The revised proposal differs minimally from the previous proposal: the
"extension" element is renamed "xsi:type".

We believe this proposal represents a significant improvement on the
current OLAC 0.4 format in the areas of simplicity, interoperability
and extensibility.  Furthermore, it puts us squarely in the DC
community: OLAC won't have to reimplement each new DC Qualifier that
the DCMI adopts; OLAC can benefit from any software that works on DC
metadata; and OLAC vocabularies can be easily adopted outside the OLAC
community.

With your approval, we will document this new format and bring it up
at the December workshop as the proposal for OLAC version 1.0.  Once
adopted, each OLAC archive would be required to support it in order to
participate in OLAC.

Please send any comments to the list.

Steven Bird & Gary Simons



More information about the Olac-implementers mailing list