A simpler format for OLAC vocabularies and schemes

Helen Dry hdry at LINGUISTLIST.ORG
Thu Sep 26 23:15:53 UTC 2002


Hi, Steven (and everyone)

Sorry to be so late responding to this proposal, but it's been a busy month.

I am a little concerned about this proposal, perhaps because I don't understand
exactly how the scheme system would work, so I thought I should make my
comments and ask a few questions.  Apologies if either or both are at a rather
elementary level--I only seem to understand DC and XML for 10 minutes, right after
I reread the websites.  :-)

It seems to me that there are two separable proposals here:  (1) collapsing the
formal mechanisms of refinement and scheme into the extension mechanism and
(2) abandoning the attempt to reach general consensus on the descriptors that
previously we were calling controlled vocabularies. The first may well be a welcome
simplification, particularly administratively. (And I seem to have heard that it's the
way the DC is going anyway.)  The second seems worrisome to me for two primary
reasons:  (1) it seems counter to the overarching OLAC (and EMELD) goal of a
unified--dare we say "standardized"?--mechanism for resource description and
retrieval within the discipline; (2) on a practical level it may complicate--perhaps to
a debilitating degree--the way that  service providers implement search facilities.

Of course, I'm thinking about LINGUIST here--we aren't an archive, so the potential
benefits of being able to DESCRIBE resources via any scheme we might devise are
not salient to me.  What I'm worried about is how we're going to offer a search
engine that makes use of all these variant descriptions.  Particularly for something
like linguistic data types--which is probably the main search field linguists will want
to use--this seems almost like a return to the bad old days of the free text field, with
the consequent loss of ability to identify and retrieve relevant resources.

Now I imagine that there is some formal mechanism for relating schemes--I know
you have a paragraph below about archives putting the schemes they use in their
identifiers.   But could you tell me exactly how this would work in practice?  E.g., at
the level of elements or terms?  Would an archive that wants to use its own scheme
have to provide a document showing how its categories relate to the categories in
all the other schemes (e.g., that its "Seediq" was SIL's "Taroko.")   Would the
service provider have to construct a search engine that would first find and correlate
all these documents, then search the multi-archive metadata for the resulting sets of
terms?  I'm sure it's possible--IF you could get everyone to provide scheme
mappings--but it certainly seems unnecessarily complex. . . and, as I said, counter
to the purpose of OLAC.  I thought we were trying to settle on a unified way to
describe linguistic resources, in order to offer the discipline the benefits of a level of
standardization.  Though this will come at the admitted expense of a certain amount
of detail and precision, I feel confident that it will be accepted (accepted for what it
is) if we persevere. After all, DC isn't perfect but people understand the utility of a
restricted set of elements.

It seems to me that, if the problem is that we may not come up with a proposal
before December, we should either redouble our efforts and make the deadline or
extend the deadline--not scrap the enterprise.  Actually, with regard to linguistic data
types, I feel confident we can come up with a reasonable proposal before the
deadline.  And I think it's important that we do so, since this is really one of the most
important vocabularies--probably the most important for a large part of our
audience, i.e. academic linguists.  It's the main way that people, as opposed to
machines, will want to search the archives.

So, in sum, I agree with the arguments for using the extension mechanism and
abandoning refinement and scheme.  But I don't see the need to abandon the goal
of reaching consensus on a single "OLAC-approved" set of linguistic data types,
however that would be modeled in a world of "extensions" (not controlled
vocabularies).  Can we use extensions but not let in the world?

BTW, under the proposal, will all the current refinements--e.g., "subject.language"
now become schemes?

But now I should stop and let someone knowledgable explain to me exactly how this
scheme system will work.  I'm all ears . . . .   :-)

Ready for enlightenment ....

-Helen






On 16 Sep 2002 at 17:39, Steven Bird wrote:

The OLAC metadata format provides two mechanisms for community-
specific resource description.  First, special refinements (metadata
elements and corresponding vocabularies) support compatible
description across the community.  For example, the subject.language
element, and the OLAC-Language vocabulary, permit all archives to
identify subject language in the same manner.  Second, every OLAC
element permits an optional scheme attribute for use by
sub-communities of OLAC.  For example, the scholars at Academia Sinica
can use their own naming scheme for Formosan languages and still
package it up using the OLAC metadata container.  This combination of
standard refinements and user-defined schemes seems to offer a
reasonable balance between interoperability and extensibility.

Over the past month, Gary and I have been reviewing the design of OLAC
metadata and have concluded that these parallel mechanisms are
unnecessary.  We think that with a *single* extension mechanism, OLAC
can provide even better interoperability and extensibility.  Moreover,
we think this can be done with less administrative and technical
infrastructure than before, making it still easier for archives to
participate in OLAC.


A. THE PRESENT SITUATION

We begin with a quick review of how the two existing mechanisms work
in OLAC metadata.  First, community-specific refinements are
represented using Dublin Core qualifications represented in XML.  Here
is an example for subject language:

  A resource about the Sikaiana language:
  <subject.language code="x-sil-SKY"/>

This refinement permits focussed searching and better precision/recall
than the corresponding Dublin Core element:

  <subject>The Sikaiana Language</subject>

The OLAC version is flexible in that the code attribute is optional
and that free-text can be put in the element content.

The second mechanism is for user-defined schemes.  All OLAC elements
permit a scheme attribute, naming some third-party format or
vocabulary that one or more OLAC archives use.  For instance, the
language listed by Ethnologue as Taroko (TRV) is known as Seediq in
Academia Sinica, and OLAC would permit either or both of the following
elements to appear in a metadata record for this language:

  <subject.language code="x-sil-TRV"/>
  <subject.language scheme="AS-Formosan">Seediq</subject.language>

Such a resource would be discovered under either naming scheme, and
Academia Sinica could provide end-user services that rewarded any archive
which employed its scheme for Formosan language identification.


B. PROBLEMS WITH THE PRESENT SITUATION

There are four general problems with the present situation.

1. Finalizing standard refinements.  Our track record at developing
controlled vocabularies over the past year indicates that we are not
going to be able to finalize all the vocabularies that the OLAC
metadata standard specifies in time for launching version 1.0 after
our December workshop.  Even if some vocabularies are finalized by
December, the discussion may be reopened any time a new kind of
archive joins OLAC.  However, each vocabulary revision must currently
be released as a new version of the entire OLAC metadata set, an
unacceptable bureaucratic obstacle.

2. The artificial distinction between refinements and schemes.  It is
not clear when a putative refinement is important enough to be adopted
as an OLAC standard, versus a user-defined scheme.  Some of the
refinements we recognize at present aren't as germane to the overall
enterprise as others (e.g. operating system vs subject language), and
may not have enough support to be retained.  Conversely, the community
is sure to develop new, useful ontologies that we don't support at
present, and we would need to change the OLAC metadata standard in
order to accommodate them.  Promoting a user-defined scheme to an OLAC
standard would necessitate a change in the XML representation, generating
unnecessary work for all archives that support the scheme.

3. Duplication of technical support.  User-defined schemes are likely
to involve controlled vocabularies, with the same needs as OLAC
vocabularies with respect to validation, translation to
human-readable form in service providers, and dumb-down to Dublin Core
for OAI interoperability.  At present, the necessary infrastructure
must be created twice over, once for each of the two mechanisms.

4. Idiosyncracies of XML schema.  XML schema is used to define the
well-formedness of OLAC records, but it is unable to express
co-occurrence constraints between attribute values.  This means that
we cannot have more than one vocabulary for an element, forcing us to
build structure into element names and multiply the names
(e.g. Format.markup, Format.cpu, Format.os, ...).  It is unfortunate
that such a fundamental aspect of the OLAC XML format depends on a
shortcoming of a tool that we may not be using for very long.

In sum, the current model will be difficult to manage over the long
term.  Administratively, it encourages us to seek premature closure on
issues of content description that can never be closed.  Technically,
it forces us to release new versions of the metadata format with each
vocabulary revision, and forces us to create software infrastructure to
support a mishmash of four syntactic extensions of DC:

   <element.EXT1 refine="EXT2" code="EXT3" scheme="EXT4">


C. A NEW APPROACH

In response to the problems outlined above, we would like to propose a
new approach.  The basic idea is simple: express all refinements,
vocabularies and schemes using a uniform DC extension mechanism, and
treat them all as recommendations instead of centrally-validated
standards.  The extension mechanism requires two attributes, called
"extension" and "code", as shown below:

  <subject extension="OLAC-Language" code="x-sil-SKY"/>
  <subject extension="AS-Formosan" code="Seediq"/>

It would be syntactically valid to simply use an extension in metadata
without defining it. However, for extensions that will be used across
the community, there must also be a formal definition that enumerates
the corresponding controlled vocabulary in such a way that data
providers and service providers alike can harvest the vocabulary from
its definitive source. Thus another aspect of the new approach is an
XML schema for the formal definition of an XDC extension. In the
description section of the OAI Identify response, a data provider
would declare which formally defined extensions it employs in its metadata.

Extensions that enjoyed broad community support would be identified as
OLAC Recommendations (following the existing OLAC Process).  All OLAC
archives would be encouraged to adopt them, in the sense that OLAC
service providers would permit end-users to perform focussed searches
over these extensions.  In this way, archives that cooperate with the
rest of the community are rewarded.

Note that the approach isn't specific to language archives, so we're
calling it extensible Dublin Core (XDC).  An example of the syntax
is available (an XML DTD, the equivalent XML schema, and an instance
document): http://www.language-archives.org/XDC/0.1/


D. BENEFITS

The new approach is technically simpler than the existing approach,
and neatly solves the four problems we reported.

1. Finalizing standard refinements.  The editors of OLAC vocabulary
documents would be empowered to edit the vocabulary into the future,
without concern for integration with new releases of the OLAC metadata
format.

2. The artificial distinction between refinements and schemes.  The
syntactic distinction is gone, being replaced by a semantic one: is
the vocabulary an OLAC Recommendation or not?  Any archive or group of
archives would be free to start using their own extensions without any
formal registration.  They could build a service to demonstrate the
merit of their extension, thereby encouraging other archives to adopt
it.  Once broad support had been established, they could build a case
for an OLAC Recommendation, leading to adoption across the community.

3. Duplication of technical support.  With the single extension
mechanism, we can provide uniform technical support for validation,
translation and dumb-down.

4. Idiosyncracies of XML schema.  We no longer give XML schema such
sway in determining our XML syntax.  Other XML and database technologies
will be used to test that an extension is used correctly.

In sum, the new approach is extensible, requiring no central
administration of extensions, and no coordination of vocabulary
revisions with new releases of the metadata format.  The new approach
also supports interoperability across the whole OLAC community (via
OLAC Recommendations) and also among OLAC sub-communities that want to
create their own special-purpose extensions.


E. IMPLICATIONS

We are still working out the technical implications for OLAC central
services (e.g. registration, Vida, ORE, etc), and we will only be able
to implement parts of this in time for the December meeting.  As
always, we would welcome donations of programmer time to help us.

The short-term implication for OLAC archives is completely trivial,
since only a simple syntactic change is required.

The most important implication of this change is that it reduces the
pressure to reach final agreement on OLAC vocabularies by our December
workshop.  But this isn't an excuse for us to slow down on that front.
On the contrary, it frees us up to find working solutions for the key
vocabularies that define us as a community.  These will always be
imperfect compromises that we can agree to work with and revise as
necessary, well into the future.

In sum, we hope we are not opening up a technical can of worms, but
facilitating progress on the substantive issues, our common
descriptive ontologies.  Therefore, we encourage people to identify a
particular extension that they would like to work on, and post their
ideas and questions to this list (as Baden Hughes has just now done
for sourcecode).  You may also like to present your ideas at our
workshop in December...

--

So, what do you think?  Do you agree with our proposals for
(i) a syntactic simplification in our XML representation, and
(ii) switching OLAC vocabularies from being centrally validated
standards to recommendations?  We would welcome your feedback.

Steven Bird & Gary Simons



More information about the Olac-implementers mailing list