From baden at COMPULING.NET  Mon Sep 16 13:15:50 2002
From: baden at COMPULING.NET (Baden Hughes)
Date: Mon, 16 Sep 2002 23:15:50 +1000
Subject: query about format.sourcecode
Message-ID: <MON.16.SEP.2002.231550.1000.>

Hi -

I've got a query about matters related to the element format.sourcecode

Currently the spec at http://www.language-archives.org/OLAC/olacms.html
assumes that software resources indexed by OLAC will be in source code
(and hence appropriate entries will be made under this tagset).

The recommendation is currently:

<format.sourcecode
code="PROGRAMMING_LANGUAGE">Comments</format.sourcecode>

There are several questions I have about this.

1) Do we need to clarify this even further as there are apparently two
distinct options from the archive contents I've been working with). One
is where the sourcecode requires compilation, the other is where
sourcecode is essentially a script (or series of scripts). Any
information about the "state" of the source code is likely to be
inconsistent at best across archives, and I suspect even within a single
archive. IMHO its relatively important to the end user of the OLAC
search engine as to what state the sourcecode is in (ie how applicable
is this code to the platforms I have access to).

2) In the case where software resources indexed by OLAC are distributed
in compiled form (ie not sourcecode) there's apparently not much more
room to encode this information either. Apart from not strictly being
something which belongs in a format.sourcecode element, the
recommendation I assume would be that you could standardise this again
by using the comment field, but the same consistency problem arises.
Again, IMHO its relatively important to the end user of the OLAC search
engine as to what state the sourcecode is in (ie can I just install and
run or is it more complex)

These two points may not represent large issues, but if the archives you
are dealing with have a lot of software which ranges from source scripts
in a range of languages, source for compilation for a range of
compilers, and compiled "ready to run" applications, the granularity of
this markup can be important and greatly assist with classification and
indexation of resources in an appropriate manner. Additionally, for the
less computer literate end users, this distinction is very important in
them effectively locating a resource which is appropriate to their
needs.

Baden


From sb at UNAGI.CIS.UPENN.EDU  Mon Sep 16 21:39:54 2002
From: sb at UNAGI.CIS.UPENN.EDU (Steven Bird)
Date: Mon, 16 Sep 2002 17:39:54 EDT
Subject: A simpler format for OLAC vocabularies and schemes
Message-ID: <MON.16.SEP.2002.173954.EDT.SB@LDC.UPENN.EDU>

The OLAC metadata format provides two mechanisms for community-
specific resource description.  First, special refinements (metadata
elements and corresponding vocabularies) support compatible
description across the community.  For example, the subject.language
element, and the OLAC-Language vocabulary, permit all archives to
identify subject language in the same manner.  Second, every OLAC
element permits an optional scheme attribute for use by
sub-communities of OLAC.  For example, the scholars at Academia Sinica
can use their own naming scheme for Formosan languages and still
package it up using the OLAC metadata container.  This combination of
standard refinements and user-defined schemes seems to offer a
reasonable balance between interoperability and extensibility.

Over the past month, Gary and I have been reviewing the design of OLAC
metadata and have concluded that these parallel mechanisms are
unnecessary.  We think that with a *single* extension mechanism, OLAC
can provide even better interoperability and extensibility.  Moreover,
we think this can be done with less administrative and technical
infrastructure than before, making it still easier for archives to
participate in OLAC.


A. THE PRESENT SITUATION

We begin with a quick review of how the two existing mechanisms work
in OLAC metadata.  First, community-specific refinements are
represented using Dublin Core qualifications represented in XML.  Here
is an example for subject language:

  A resource about the Sikaiana language:
  <subject.language code="x-sil-SKY"/>

This refinement permits focussed searching and better precision/recall
than the corresponding Dublin Core element:

  <subject>The Sikaiana Language</subject>

The OLAC version is flexible in that the code attribute is optional
and that free-text can be put in the element content.

The second mechanism is for user-defined schemes.  All OLAC elements
permit a scheme attribute, naming some third-party format or
vocabulary that one or more OLAC archives use.  For instance, the
language listed by Ethnologue as Taroko (TRV) is known as Seediq in
Academia Sinica, and OLAC would permit either or both of the following
elements to appear in a metadata record for this language:

  <subject.language code="x-sil-TRV"/>
  <subject.language scheme="AS-Formosan">Seediq</subject.language>

Such a resource would be discovered under either naming scheme, and
Academia Sinica could provide end-user services that rewarded any archive
which employed its scheme for Formosan language identification.


B. PROBLEMS WITH THE PRESENT SITUATION

There are four general problems with the present situation.

1. Finalizing standard refinements.  Our track record at developing
controlled vocabularies over the past year indicates that we are not
going to be able to finalize all the vocabularies that the OLAC
metadata standard specifies in time for launching version 1.0 after
our December workshop.  Even if some vocabularies are finalized by
December, the discussion may be reopened any time a new kind of
archive joins OLAC.  However, each vocabulary revision must currently
be released as a new version of the entire OLAC metadata set, an
unacceptable bureaucratic obstacle.

2. The artificial distinction between refinements and schemes.  It is
not clear when a putative refinement is important enough to be adopted
as an OLAC standard, versus a user-defined scheme.  Some of the
refinements we recognize at present aren't as germane to the overall
enterprise as others (e.g. operating system vs subject language), and
may not have enough support to be retained.  Conversely, the community
is sure to develop new, useful ontologies that we don't support at
present, and we would need to change the OLAC metadata standard in
order to accommodate them.  Promoting a user-defined scheme to an OLAC
standard would necessitate a change in the XML representation, generating
unnecessary work for all archives that support the scheme.

3. Duplication of technical support.  User-defined schemes are likely
to involve controlled vocabularies, with the same needs as OLAC
vocabularies with respect to validation, translation to
human-readable form in service providers, and dumb-down to Dublin Core
for OAI interoperability.  At present, the necessary infrastructure
must be created twice over, once for each of the two mechanisms.

4. Idiosyncracies of XML schema.  XML schema is used to define the
well-formedness of OLAC records, but it is unable to express
co-occurrence constraints between attribute values.  This means that
we cannot have more than one vocabulary for an element, forcing us to
build structure into element names and multiply the names
(e.g. Format.markup, Format.cpu, Format.os, ...).  It is unfortunate
that such a fundamental aspect of the OLAC XML format depends on a
shortcoming of a tool that we may not be using for very long.

In sum, the current model will be difficult to manage over the long
term.  Administratively, it encourages us to seek premature closure on
issues of content description that can never be closed.  Technically,
it forces us to release new versions of the metadata format with each
vocabulary revision, and forces us to create software infrastructure to
support a mishmash of four syntactic extensions of DC:

   <element.EXT1 refine="EXT2" code="EXT3" scheme="EXT4">


C. A NEW APPROACH

In response to the problems outlined above, we would like to propose a
new approach.  The basic idea is simple: express all refinements,
vocabularies and schemes using a uniform DC extension mechanism, and
treat them all as recommendations instead of centrally-validated
standards.  The extension mechanism requires two attributes, called
"extension" and "code", as shown below:

  <subject extension="OLAC-Language" code="x-sil-SKY"/>
  <subject extension="AS-Formosan" code="Seediq"/>

It would be syntactically valid to simply use an extension in metadata
without defining it. However, for extensions that will be used across
the community, there must also be a formal definition that enumerates
the corresponding controlled vocabulary in such a way that data
providers and service providers alike can harvest the vocabulary from
its definitive source. Thus another aspect of the new approach is an
XML schema for the formal definition of an XDC extension. In the
description section of the OAI Identify response, a data provider
would declare which formally defined extensions it employs in its metadata.

Extensions that enjoyed broad community support would be identified as
OLAC Recommendations (following the existing OLAC Process).  All OLAC
archives would be encouraged to adopt them, in the sense that OLAC
service providers would permit end-users to perform focussed searches
over these extensions.  In this way, archives that cooperate with the
rest of the community are rewarded.

Note that the approach isn't specific to language archives, so we're
calling it extensible Dublin Core (XDC).  An example of the syntax
is available (an XML DTD, the equivalent XML schema, and an instance
document): http://www.language-archives.org/XDC/0.1/


D. BENEFITS

The new approach is technically simpler than the existing approach,
and neatly solves the four problems we reported.

1. Finalizing standard refinements.  The editors of OLAC vocabulary
documents would be empowered to edit the vocabulary into the future,
without concern for integration with new releases of the OLAC metadata
format.

2. The artificial distinction between refinements and schemes.  The
syntactic distinction is gone, being replaced by a semantic one: is
the vocabulary an OLAC Recommendation or not?  Any archive or group of
archives would be free to start using their own extensions without any
formal registration.  They could build a service to demonstrate the
merit of their extension, thereby encouraging other archives to adopt
it.  Once broad support had been established, they could build a case
for an OLAC Recommendation, leading to adoption across the community.

3. Duplication of technical support.  With the single extension
mechanism, we can provide uniform technical support for validation,
translation and dumb-down.

4. Idiosyncracies of XML schema.  We no longer give XML schema such
sway in determining our XML syntax.  Other XML and database technologies
will be used to test that an extension is used correctly.

In sum, the new approach is extensible, requiring no central
administration of extensions, and no coordination of vocabulary
revisions with new releases of the metadata format.  The new approach
also supports interoperability across the whole OLAC community (via
OLAC Recommendations) and also among OLAC sub-communities that want to
create their own special-purpose extensions.


E. IMPLICATIONS

We are still working out the technical implications for OLAC central
services (e.g. registration, Vida, ORE, etc), and we will only be able
to implement parts of this in time for the December meeting.  As
always, we would welcome donations of programmer time to help us.

The short-term implication for OLAC archives is completely trivial,
since only a simple syntactic change is required.

The most important implication of this change is that it reduces the
pressure to reach final agreement on OLAC vocabularies by our December
workshop.  But this isn't an excuse for us to slow down on that front.
On the contrary, it frees us up to find working solutions for the key
vocabularies that define us as a community.  These will always be
imperfect compromises that we can agree to work with and revise as
necessary, well into the future.

In sum, we hope we are not opening up a technical can of worms, but
facilitating progress on the substantive issues, our common
descriptive ontologies.  Therefore, we encourage people to identify a
particular extension that they would like to work on, and post their
ideas and questions to this list (as Baden Hughes has just now done
for sourcecode).  You may also like to present your ideas at our
workshop in December...

--

So, what do you think?  Do you agree with our proposals for
(i) a syntactic simplification in our XML representation, and
(ii) switching OLAC vocabularies from being centrally validated
standards to recommendations?  We would welcome your feedback.

Steven Bird & Gary Simons


From sb at UNAGI.CIS.UPENN.EDU  Mon Sep 16 22:13:15 2002
From: sb at UNAGI.CIS.UPENN.EDU (Steven Bird)
Date: Mon, 16 Sep 2002 18:13:15 EDT
Subject: query about format.sourcecode
In-Reply-To: Your mail dated Monday 16 September, 2002.
Message-ID: <MON.16.SEP.2002.181315.EDT.SB@LDC.UPENN.EDU>

Baden Hughes <baden at compuling.net> wrote:
> I've got a query about matters related to the element format.sourcecode

Its good to see discussion of software resources for a change, and I hope
the maintainers of software archives (DFKI, TRACTOR) will contribute to
this discussion.

> Currently the spec at http://www.language-archives.org/OLAC/olacms.html
> assumes that software resources indexed by OLAC will be in source code
> (and hence appropriate entries will be made under this tagset).

Not quite - all OLAC elements are optional, and some elements are simply
inappropriate for some resources.  Software distributed in binary form only
doesn't need to be given any sourcecode descriptor.

> The recommendation is currently:
>
> <format.sourcecode
> code="PROGRAMMING_LANGUAGE">Comments</format.sourcecode>
>
> There are several questions I have about this.
>
> 1) Do we need to clarify this even further as there are apparently two
> distinct options from the archive contents I've been working with). One
> is where the sourcecode requires compilation, the other is where
> sourcecode is essentially a script (or series of scripts). Any
> information about the "state" of the source code is likely to be
> inconsistent at best across archives, and I suspect even within a single
> archive. IMHO its relatively important to the end user of the OLAC
> search engine as to what state the sourcecode is in (ie how applicable
> is this code to the platforms I have access to).

Good, so the end-user requirement here is to be able to answer the
question: "Can I run this software?"

> 2) In the case where software resources indexed by OLAC are distributed
> in compiled form (ie not sourcecode) there's apparently not much more
> room to encode this information either. Apart from not strictly being
> something which belongs in a format.sourcecode element, the
> recommendation I assume would be that you could standardise this again
> by using the comment field, but the same consistency problem arises.
> Again, IMHO its relatively important to the end user of the OLAC search
> engine as to what state the sourcecode is in (ie can I just install and
> run or is it more complex)

Right, so the end-user requirement here is to be able to answer the
question: "How much effort will be required to get this running?"

> These two points may not represent large issues, but if the archives you
> are dealing with have a lot of software which ranges from source scripts
> in a range of languages, source for compilation for a range of
> compilers, and compiled "ready to run" applications, the granularity of
> this markup can be important and greatly assist with classification and
> indexation of resources in an appropriate manner. Additionally, for the
> less computer literate end users, this distinction is very important in
> them effectively locating a resource which is appropriate to their
> needs.

Absolutely.  Currently we have vocabularies for Sourcecode, CPU, and OS.
However, we can modify of scrap them if they don't serve our needs for
resource description and discovery.  Perhaps we need a new vocabulary
that better describes the state of the sourcecode.

One way to proceed here is for Baden (and any others) to identify the full
range of end-user requirements (is it more than these two?) then propose
vocabularies that best serve these requirements...

-Steven

--
Steven.Bird at ldc.upenn.edu  http://www.ldc.upenn.edu/sb
Assoc Director, LDC; Adj Assoc Prof, CIS & Linguistics
Linguistic Data Consortium, University of Pennsylvania
3600 Market St, Suite 810, Philadelphia, PA 19104-2653


From baden at COMPULING.NET  Fri Sep 20 11:57:22 2002
From: baden at COMPULING.NET (Baden Hughes)
Date: Fri, 20 Sep 2002 21:57:22 +1000
Subject: proposed revision of format.os
Message-ID: <FRI.20.SEP.2002.215722.1000.>

In working with several archives and drawing on other IT experience, I'd
like to make some proposed changes to the format.os schema.

---
<?xml version="1.0" encoding="utf-8" ?>
<schema xmlns="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.language-archives.org/OLAC/0.4/">
<annotation>
<documentation>1.0 OLAC Schema for operating system types, Steven Bird,
4/27/01 1.1 draft OLAC Schema for operating system types, Baden Hughes,
19/09/02</documentation>
</annotation>
<simpleType name="OLAC-OS-Code">
  <restriction base="string">
  <enumeration value="Unix" />
  <enumeration value="Unix/Linux" />
  <enumeration value="Unix/Solaris" />
  <enumeration value="Unix/SunOS" />
  <enumeration value="Unix/SCO" />
  <enumeration value="Unix/AIX" />
  <enumeration value="Unix/BSD" />
  <enumeration value="Unix/FreeBSD" />
  <enumeration value="Unix/OpenBSD" />
  <enumeration value="Unix/NetBSD" />
  <enumeration value="Unix/DECAlpha" />
  <enumeration value="Unix/GNU-Hurd" />
  <enumeration value="Unix/HPBLS" />
  <enumeration value="Unix/HPUX" />
  <enumeration value="Unix/IRIX" />
  <enumeration value="Unix/AIX" />
  <enumeration value="Unix/UnixWare" />
  <enumeration value="Unix/Xenix" />
  <enumeration value="Unix/VMS" />
  <enumeration value="SonyClieOS" />
  <enumeration value="Amiga" />
  <enumeration value="PalmOS" />
  <enumeration value="BeOS" />
  <enumeration value="NextSTEP" />
  <enumeration value="MacOS" />
  <enumeration value="MacOS/OSX" />
  <enumeration value="OS2" />
  <enumeration value="MSDOS" />
  <enumeration value="4DOS" />
  <enumeration value="MSWindows" />
  <enumeration value="MSWindows/win31" />
  <enumeration value="MSWindows/win95" />
  <enumeration value="MSWindows/winNT" />
  <enumeration value="MSWindows/win98" />
  <enumeration value="MSWindows/win2k" />
  <enumeration value="MSWindows/winCE" />
  <enumeration value="MSWindows/winME" />
  <enumeration value="MSWindows/winXP" />
  <enumeration value="MSWindows/PocketPC" />
  <enumeration value="MSWindows/PocketPC2002" />
  <enumeration value="MSWindows/.NET" />
  </restriction>
  </simpleType>
  </schema>
---

You can also find this draft schema at
http://www.compuling.net/projects/olac/190902-draft-olac-format.os.xsd

These changes essentially add to the list if possible operating systems
that I've encountered in classifying software.

If preferred, I can circulate to the list. If there's others interested
in working on this document, I'm more than happy to collaborate.

Baden


From baden at COMPULING.NET  Fri Sep 20 12:15:53 2002
From: baden at COMPULING.NET (Baden Hughes)
Date: Fri, 20 Sep 2002 22:15:53 +1000
Subject: proposed revision of format.cpu
Message-ID: <FRI.20.SEP.2002.221553.1000.>

In working with several archives and drawing on other IT experience, I'd
like to make some proposed changes to the format.cpu schema, (without
regurgitating the entire history of computing in the process :-).

---
<?xml version="1.0" encoding="utf-8"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema"
        targetNamespace="http://www.language-archives.org/OLAC/0.4/">

  <annotation>
    <documentation>
      1.0 OLAC Schema for CPUs, Steven Bird, 5/7/01
      1.1 draft OLAC Schema for CPU, Baden Hughes, 19/09/02
    </documentation>
  </annotation>

  <simpleType name="OLAC-CPU-Code">
    <restriction base="string">
      <enumeration value="x86"/>
      <enumeration value="MIPS"/>
      <enumeration value="Alpha"/>
      <enumeration value="Sparc"/>
      <enumeration value="680x0"/>
      <enumeration value="PA-RISC"/>
      <enumeration value="ARM"/>
      <enumeration value="ARM32"/>
      <enumeration value="Itanium"/>
      <enumeration value="IBM System 360/370/390"/>
      <enumeration value="Clipper"/>
      <enumeration value="i370"/>
      <enumeration value="i860"/>
      <enumeration value="i960"/>
      <enumeration value="Power4"/>
      <enumeration value="Cray"/>
      <enumeration value="m68k"/>
      <enumeration value="m88k"/>
      <enumeration value="ns32k"/>
      <enumeration value="IBM rs6000"/>
      <enumeration value="IBM rs10000"/>
      <enumeration value="s390"/>
      <enumeration value="VAX"/>
      <enumeration value="PowerPC"/>
      <enumeration value="DragonBall"/>
      <enumeration value="StrongARM"/>
      <enumeration value="Intel 8088"/>
      <enumeration value="Intel 8080"/>
      <enumeration value="Intel 8008"/>
      <enumeration value="Intel IA-64"/>
      <enumeration value="Motorola 6502"/>
      <enumeration value="Motorola 6800"/>
      <enumeration value="Motorola ColdFire"/>
      <enumeration value="Zilog Z80"/>
      <enumeration value="Transmeta Caruso"/>
      <enumeration value="SH3"/>
      <enumeration value="x86_64"/>


    </restriction>
  </simpleType>

</schema>
---

You can also find this draft schema at
http://www.compuling.net/projects/olac/190902-draft-olac-format.cpu.xsd

These changes essentially add to the list if possible operating systems
that I've encountered in classifying cpu architectures relevant to
language software. This includes some older mid-range style
architectures and the latest handheld architectures.

If preferred, I can circulate to the list. If there's others interested
in working on this document, again I'm more than happy to collaborate.

Baden


From baden at COMPULING.NET  Mon Sep 23 02:06:04 2002
From: baden at COMPULING.NET (Baden Hughes)
Date: Mon, 23 Sep 2002 12:06:04 +1000
Subject: fproposed revision of format.sourcecode
Message-ID: <MON.23.SEP.2002.120604.1000.>

After a survey of several language archives, I'd like to propose some
possible changes to the format.sourceode schema. Essentially this list
is a list of programming languages of various types, in which software
may be written. This list includes those found at:
http://www.hypernews.org/HyperNews/get/computing/lang-list.html

A draft can be found online at:
http://www.compuling.net/projects/olac/220902-draft-olac-format.sourceco
de.xsd

Comments welcome.

Baden


From sb at UNAGI.CIS.UPENN.EDU  Mon Sep 23 06:38:54 2002
From: sb at UNAGI.CIS.UPENN.EDU (Steven Bird)
Date: Mon, 23 Sep 2002 02:38:54 EDT
Subject: fproposed revision of format.sourcecode
In-Reply-To: Your mail dated Monday 23 September, 2002.
Message-ID: <MON.23.SEP.2002.023854.EDT.SB@LDC.UPENN.EDU>

Baden Hughes <baden at compuling.net> wrote:
> After a survey of several language archives, I'd like to propose some
> possible changes to the format.sourceode schema. Essentially this list
> is a list of programming languages of various types, in which software
> may be written. This list includes those found at:
> http://www.hypernews.org/HyperNews/get/computing/lang-list.html
>
> A draft can be found online at:
> http://www.compuling.net/projects/olac/220902-draft-olac-format.sourcecode.xsd
>
> Comments welcome.

This is great - a 20-fold increase on the number listed in my original 0.4
list.  I grepped for a few obscure languages and they were all there.

I'd like to raise two low-level technical issues, capitalization and
whitespace.

First, 99% of the codes are all-caps, even though some programming language
names are not written like this (e.g. the list gives "PROLOG" but it should
really be "Prolog").  However, rather than having to settle disputes about
this question, I'd prefer it if we case-normalized everything.  What do
people think - should we standardize on uppercase?

Second, Baden's list includes many items with spaces, e.g. "OBJECTIVE
CAML".  However, it seems desirable to limit the range of characters that
can appear in a controlled vocabulary item (e.g. no accents) so that there
is no transmission problems etc.  In some contexts, such as hand-crafted
CGI Get requests and HTML anchors, it is a pain to have to manually escape
the space character.  Could we live with a restriction of no spaces -
i.e. replacing spaces with underscore?

** Note that neither of these issues is substantive, since each controlled
vocabulary item will be associated with a human readable form (including
translations into other languages).  For example, in Dublin Core, there is
a refinement named "hasVersion" with the human-readable label "Has
Version".  [http://www.dublincore.org/documents/dcmes-qualifiers/].
The plan is to do the same thing for OLAC vocabularies.

-Steven


From baden at COMPULING.NET  Mon Sep 23 07:00:36 2002
From: baden at COMPULING.NET (Baden Hughes)
Date: Mon, 23 Sep 2002 17:00:36 +1000
Subject: fproposed revision of format.sourcecode
In-Reply-To: <200209230639.g8N6csL10762@unagi.cis.upenn.edu>
Message-ID: <MON.23.SEP.2002.170036.1000.>

I've updated the format.sourcecode schema draft with:

-unnecessary whitespace removed
-whitespace normalized to underscores in enumeration values
-typos corrected

You can find the updated list here:

http://www.compuling.net/projects/olac/230902-draft-olac-format.sourceco
de.xsd

There's currently 285 programming languages listed on this schema. If
any one has any more to add, drop me an email.

Regards

Baden

> -----Original Message-----
> From: Steven Bird [mailto:sb at unagi.cis.upenn.edu]
> Sent: Monday, 23 September 2002 16:39
> To: baden at compuling.net
> Cc: OLAC-IMPLEMENTERS at LISTSERV.LINGUISTLIST.ORG
> Subject: Re: fproposed revision of format.sourcecode
>
>
>
> Baden Hughes <baden at compuling.net> wrote:
> > After a survey of several language archives, I'd like to
> propose some
> > possible changes to the format.sourceode schema.
> Essentially this list
> > is a list of programming languages of various types, in
> which software
> > may be written. This list includes those found at:
> > http://www.hypernews.org/HyperNews/get/computing/lang-list.html
> >
> > A draft can be found online at:
> >
> http://www.compuling.net/projects/olac/220902->
draft-olac-format.source
> > code.xsd
> >
> > Comments welcome.
>
> This is great - a 20-fold increase on the number listed in my
> original 0.4 list.  I grepped for a few obscure languages and
> they were all there.
>
> I'd like to raise two low-level technical issues,
> capitalization and whitespace.
>
> First, 99% of the codes are all-caps, even though some
> programming language names are not written like this (e.g.
> the list gives "PROLOG" but it should really be "Prolog").
> However, rather than having to settle disputes about this
> question, I'd prefer it if we case-normalized everything.
> What do people think - should we standardize on uppercase?
>
> Second, Baden's list includes many items with spaces, e.g.
> "OBJECTIVE CAML".  However, it seems desirable to limit the
> range of characters that can appear in a controlled
> vocabulary item (e.g. no accents) so that there is no
> transmission problems etc.  In some contexts, such as
> hand-crafted CGI Get requests and HTML anchors, it is a pain
> to have to manually escape the space character.  Could we
> live with a restriction of no spaces - i.e. replacing spaces
> with underscore?
>
> ** Note that neither of these issues is substantive, since
> each controlled vocabulary item will be associated with a
> human readable form (including translations into other
> languages).  For example, in Dublin Core, there is a
> refinement named "hasVersion" with the human-readable label
> "Has Version".
> [http://www.dublincore.org/documents/dcmes-> qualifiers/].
> The
> plan is to do the same thing for OLAC vocabularies.
>
> -Steven
>


From ruyng at GATE.SINICA.EDU.TW  Mon Sep 23 10:29:42 2002
From: ruyng at GATE.SINICA.EDU.TW (Ru-Yng Chang)
Date: Mon, 23 Sep 2002 06:29:42 -0400
Subject: fproposed revision of format.sourcecode
Message-ID: <MON.23.SEP.2002.062942.0400.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

Dear all,

I find the difference between the draft and
the code for program language of the standard of Chinese  catalogue from
National Central Library.
http://datas.ncl.edu.tw/catweb/2-1-2a.htm(Big-5 encoding.)

As the list.


---A-----------
ADAPTIVE SERVER ENTERPRISE
ADS-C
AL
ALPHARD
ANALITIK
ANNA
APL2


---B-----------
BCY/B


---C-----------
CADL
CALM
CANDE
CCL
CIP-L
CLIPPER
COLTS
COMSKEE
CONCURRENT_EUCLID


---D-----------
D.L.LOGO
DATAPLOT
DBL
DIST
DYNAMO


---E-----------
EDISON
ELAN


---F-----------
FOCUS
FRED


---G-----------
GHC
GLYPNIR


---H-----------
HYPERTALK


---I-----------
IDL
INFORMIX-4GL
INTERPRESS
ISETL
ISP


---J-----------
JAVA
JAVA_APPLET (INCLUED IN JAVA)
JAVA_WORKSHOP (INCLUED IN JAVA)
JOSEF


---K-----------
KHUWARIZMI
KYLIX


---L-----------
LISP
LOGLAN_82
LOGO
LOTUS_SCRIPT
LUCID


---M-----------
MACRO-11
MFC
MODULA-2
MOUSE


---M-----------
NATAL
NPL

---O-----------
OCCAM2
OPS5

---P-----------
PARAGON
PARLOG
PILOT
PLEASE
PL/1
PL/M51
PL/SQL
POP11
PORTAL
PSEUDOCODE
PUCMAT

---Q-----------
QEDIT


---R-----------
ROSS


---S-----------
S-ALGOL
SGML
SHELL
SIMNET
SMAL/80
SNAP
SNOBOL
SPECOL
SPITBOL
SQL/ORACLE
STAROFFICE
STEP_3
STEP_5
SURVIS


---T-----------
T
TIME_SERIES_PROCESSOR
TURBO
TUTOR


---U-----------
UCSD_PASCAL
UNIGRAPHICS
UNISON_AUTHOR_LANGUAGE


Ru-Yng


From baden at COMPULING.NET  Mon Sep 23 13:28:23 2002
From: baden at COMPULING.NET (Baden Hughes)
Date: Mon, 23 Sep 2002 23:28:23 +1000
Subject: proposed revision of format.sourcecode
In-Reply-To: <OLAC-IMPLEMENTERS%2002092306394299@LISTSERV.LINGUISTLIST.ORG>
Message-ID: <MON.23.SEP.2002.232823.1000.>

An updated version of the format.sourcecode schema is now available
online with additions from Ru-Yng Chang.

http://www.compuling.net/projects/olac/240902-draft-olac-format.sourceco
de.xsd

Regards

Baden


From gary.holton at UAF.EDU  Tue Sep 24 14:07:00 2002
From: gary.holton at UAF.EDU (Gary Holton)
Date: Tue, 24 Sep 2002 10:07:00 -0400
Subject: A simpler format for OLAC vocabularies and schemes
Message-ID: <TUE.24.SEP.2002.100700.0400.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

On Mon, 16 Sep 2002 17:39:54 EDT, Steven Bird <sb at UNAGI.CIS.UPENN.EDU>
wrote:
>--
>
>So, what do you think?  Do you agree with our proposals for
>(i) a syntactic simplification in our XML representation, and
>(ii) switching OLAC vocabularies from being centrally validated
>standards to recommendations?  We would welcome your feedback.
>


Dear Steven & Gary,

I haven't had much time to digest your proposal, but my initial reaction is
very positive. Regarding (i), it is clear that a syntactic simplification
is needed. I for one have never been able to keep straight refinements vs.
schemes, and I don't think I'm alone here. And as you point out (ii), the
real issue should be not whether a particular refinement (and associated
vocabulary) has been officially adopted (mandated?), but rather whether a
such a refinement is useful to the community. We can debate ontologies, but
it is more difficult to debate usefulness without actually implementing a
refinement. Your proposal would permit refinements ("extensions") to fit
the needs of the community, so that useful solutions could evolve.

I have often approached the metadata issue by trying to imagine what types
of refinements and vocabularies would be useful to the end user. The
difficulty is that we don't know enough about how the user will be
searching, what they will be searching for, and what types of search
facilities they will have. The best we can do at this point is make an
educated guess and then watch closely to see how the refinements and
vocabularies are actually used. That said, I think we have some very good
guesses already and will certainly be able to recommend best practices by
December. However, if we lock in the vocabularies then most archives will
continue to have to support both an OLAC schema and a user-defined schema
(as you point out). This would essentially remove the data provider from
the loop, in that user-defined schemas would be viewed as idiosyncratic and
non-standard. Allowing user-defined "extensions" would encourage innovation
on the part of both data and service providers--innovation mediated by the
end user.

Any reactions from others?

Gary Holton


From sb at UNAGI.CIS.UPENN.EDU  Tue Sep 24 22:25:11 2002
From: sb at UNAGI.CIS.UPENN.EDU (Steven Bird)
Date: Tue, 24 Sep 2002 18:25:11 EDT
Subject: A simpler format for OLAC vocabularies and schemes
Message-ID: <TUE.24.SEP.2002.182511.EDT.SB@LDC.UPENN.EDU>

Thanks for the positive feedback.  While we await more reactions let me
jump in and say that Gary and I are working on a revised version of
the proposal to bring it into line with new developments in the Dublin Core
Metadata Initiative (DCMI).  We'll preserve the new extensibility that
people seem to appreciate, but also make syntactic changes to maximize
interoperability with the wider digital libraries community.

In the past we've basically gone it alone in working out how to represent
our own DC qualifications in XML.  However, the timing of these
recommendations and our forthcoming workshop present us with a new
opportunity to standardize our implementation.

If you'd like to learn more about what's happening in DCMI with qualifiers
and XML, please see the following article and the material it cites:

  Recommendations for XML Schema for Qualified Dublin Core
  Proposal to DC Architecture Working Group
  http://www.ukoln.ac.uk/metadata/dcmi/xmlschema/

Next week we'll circulate a proposal for how OLAC can conform with this.

Note that this is only about XML implementation and not OLAC content.
For those who only care about disseminating metadata, conformance with the
DCMI recommendations will ensure maximal interoperability with the wider
digital libraries community, so that your metadata pops up all over
cyberspace.


Back on the subject of extensibility...  The key innovation in our recent
proposal, that we'd still like more feedback on, is for the OLAC
vocabularies to be changed from being centrally enforced standards to
recommended practices.  Under this model, any archive will be able to adopt
and promulgate its favorite ontologies, while the OLAC Process is still
used to identify community-agreed best practices that everyone should
follow.

For instance, consider the sourcecode vocabulary, which is only relevant to
the software archives and which may need constant updates.  Under the
proposed model, the vocabulary wouldn't actually need to reside on the OLAC
site; it could live wherever it could be easily maintained.  However, the
OLAC site would host the details of any associated working group, so that
others could discover the group and contribute to the revision of the
vocabulary.  It would also host any associated OLAC recommendation, so that
everyone would know that the OLAC community had adopted a certain
vocabulary as best practice.

-Steven


From jcgood at SOCRATES.BERKELEY.EDU  Tue Sep 24 23:23:50 2002
From: jcgood at SOCRATES.BERKELEY.EDU (Jeff Good)
Date: Tue, 24 Sep 2002 16:23:50 -0700
Subject: A simpler format for OLAC vocabularies and schemes
In-Reply-To: <200209162139.g8GLdsL27812@unagi.cis.upenn.edu>
Message-ID: <TUE.24.SEP.2002.162350.0700.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

Hello,

I wanted to say that I think the basic designs of the revisions proposed
by Steven and Gary are very good suggestions. I completely agree with
Gary Holton's points--so I won't repeat them.

I thought I'd point out how I think these revisions can be usefully
applied to some problems that the working group evaluating the
linguistic types document. I think this new format will allow us to get
past many issues which I thought may have been intractable. I guess I
consider this to be a good "empirical" test of the proposal.

The specific problem was that there are many cross-cutting ways to
classify the "type" of a linguistic document. There's a sense in which a
document focuses on a big sub-field of linguistics like phonology,
morphology, etc. There's the basic structure of a document: dictionary,
grammar, text (the term "macrostructure" can be used to describe this
category). And then there are important "meso/micro-structure" aspects
of documents---like the type of transcription used (free translation,
interlinear, etc.)

The original OLAC system encouraged us to create an ontology of document
types which assumed that there was one "type" for a document, when, in
reality, type is a multi-dimensional concept. As we realized this, we
started to break down the types into the most important dimensions--like
linguistic subject, basic structure, etc. But even then, there were
problems of classification. For example, categories like "oratory",
"narrative", "ludic" seemed appropriate for some linguistic
documents--but it isn't immediately clear where they belong in a
hierarchy of types (are they structural or content types? or are they
something else?).

It was possible to create a system of types which works, but I think
many of our conceptual and implementational problems can be more cleanly
solved by the new systems because of it extensibility.

Specifically, rather than having to pigeonhole types into a few
categories in a hierarchy, we can just propose a series of vocabularies
corresponding to the potentially independent "type" parameters of a
document--for example, a linguistic subject vocabulary, a document
structural type vocabulary, a "discourse"-type vocabulary for things
like "oratory" and "narrative". (For more detail on this, there are
relevant recent posts, one from me, on the Metadata list.)

Over time, I'm sure we'll find some of the vocabularies are more
useful/used than others--and these can become OLAC recommended standard
vocabularies. I think the real value of the new system will be that it
is much more forgiving/flexible if we find we need to adapt our "type"
categories in the future.

Since Steven just posted about the idea that vocabularies be recommended
practices, I'll say that I think that aspect of the proposal is also
very helpful to working out a linguistic type vocabulary. One thing that
at least I am convinced of in the discussion of "types" is that there is
a counterexample to every generalization you can make about them. It may
be the case that some counterexamples are minor enough that we can get
away without a good classification for them. Or it might be the case
that a counterexample is revealing a set of important omissions in the
proposals. It's hard to tell without testing a lot of archives.

A recommended, but not enforced, vocabulary would address this
problem--as archivers encounter situations that aren't covered, they
wouldn't be forced to "fit" their document into a category where it
doesn't belong. This would not only promote the creation of needed new
vocabulary items but also maintain the integrity of existing ones.

Additionally, the idea of recommended vocabularies, plus a best practice
standard, certainly is more in line with the general spirit of OLAC, and
I think it would encourage more subcommunities to get involved and
create vocabularies which they need.

Jeff


From baden at COMPULING.NET  Wed Sep 25 04:31:35 2002
From: baden at COMPULING.NET (Baden Hughes)
Date: Wed, 25 Sep 2002 14:31:35 +1000
Subject: A simpler format for OLAC vocabularies and schemes
In-Reply-To: <200209162139.g8GLdsL27812@unagi.cis.upenn.edu>
Message-ID: <WED.25.SEP.2002.143135.1000.>

> So, what do you think?  Do you agree with our proposals for
> (i) a syntactic simplification in our XML representation, and

The syntactic revision I personally agree with. Backwards and future
compatibility is a significant factor and as such the new revisions I
believe will make it easier to implement changes community wide and
benefit archives who require special purpose extensions.

> (ii) switching OLAC vocabularies from being centrally
> validated standards to recommendations?  We would welcome
> your feedback.

The proposal for recommendations rather than mandated standards seems to
draw partially on both the W3C and IETF processes, whereby drafts or
notes are submitted, reviewed, implemented and then reviewed with the
view to standardisation if agreed as best practice. This process scales
very well, and yet allows individuals or institutions the freedom to
innovate whilst encouraging best practice once peer review of
implementations has taken place. I think this is important to encourage
innovation amongst participating archives who develop vocabularies to
address their own needs first and then promote the benefits of these for
wider community consideration.

Baden


From hdry at LINGUISTLIST.ORG  Thu Sep 26 23:15:53 2002
From: hdry at LINGUISTLIST.ORG (Helen Dry)
Date: Thu, 26 Sep 2002 19:15:53 -0400
Subject: A simpler format for OLAC vocabularies and schemes
In-Reply-To: <200209162139.g8GLdsL27812@unagi.cis.upenn.edu>
Message-ID: <THU.26.SEP.2002.191553.0400.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

Hi, Steven (and everyone)

Sorry to be so late responding to this proposal, but it's been a busy month.

I am a little concerned about this proposal, perhaps because I don't understand
exactly how the scheme system would work, so I thought I should make my
comments and ask a few questions.  Apologies if either or both are at a rather
elementary level--I only seem to understand DC and XML for 10 minutes, right after
I reread the websites.  :-)

It seems to me that there are two separable proposals here:  (1) collapsing the
formal mechanisms of refinement and scheme into the extension mechanism and
(2) abandoning the attempt to reach general consensus on the descriptors that
previously we were calling controlled vocabularies. The first may well be a welcome
simplification, particularly administratively. (And I seem to have heard that it's the
way the DC is going anyway.)  The second seems worrisome to me for two primary
reasons:  (1) it seems counter to the overarching OLAC (and EMELD) goal of a
unified--dare we say "standardized"?--mechanism for resource description and
retrieval within the discipline; (2) on a practical level it may complicate--perhaps to
a debilitating degree--the way that  service providers implement search facilities.

Of course, I'm thinking about LINGUIST here--we aren't an archive, so the potential
benefits of being able to DESCRIBE resources via any scheme we might devise are
not salient to me.  What I'm worried about is how we're going to offer a search
engine that makes use of all these variant descriptions.  Particularly for something
like linguistic data types--which is probably the main search field linguists will want
to use--this seems almost like a return to the bad old days of the free text field, with
the consequent loss of ability to identify and retrieve relevant resources.

Now I imagine that there is some formal mechanism for relating schemes--I know
you have a paragraph below about archives putting the schemes they use in their
identifiers.   But could you tell me exactly how this would work in practice?  E.g., at
the level of elements or terms?  Would an archive that wants to use its own scheme
have to provide a document showing how its categories relate to the categories in
all the other schemes (e.g., that its "Seediq" was SIL's "Taroko.")   Would the
service provider have to construct a search engine that would first find and correlate
all these documents, then search the multi-archive metadata for the resulting sets of
terms?  I'm sure it's possible--IF you could get everyone to provide scheme
mappings--but it certainly seems unnecessarily complex. . . and, as I said, counter
to the purpose of OLAC.  I thought we were trying to settle on a unified way to
describe linguistic resources, in order to offer the discipline the benefits of a level of
standardization.  Though this will come at the admitted expense of a certain amount
of detail and precision, I feel confident that it will be accepted (accepted for what it
is) if we persevere. After all, DC isn't perfect but people understand the utility of a
restricted set of elements.

It seems to me that, if the problem is that we may not come up with a proposal
before December, we should either redouble our efforts and make the deadline or
extend the deadline--not scrap the enterprise.  Actually, with regard to linguistic data
types, I feel confident we can come up with a reasonable proposal before the
deadline.  And I think it's important that we do so, since this is really one of the most
important vocabularies--probably the most important for a large part of our
audience, i.e. academic linguists.  It's the main way that people, as opposed to
machines, will want to search the archives.

So, in sum, I agree with the arguments for using the extension mechanism and
abandoning refinement and scheme.  But I don't see the need to abandon the goal
of reaching consensus on a single "OLAC-approved" set of linguistic data types,
however that would be modeled in a world of "extensions" (not controlled
vocabularies).  Can we use extensions but not let in the world?

BTW, under the proposal, will all the current refinements--e.g., "subject.language"
now become schemes?

But now I should stop and let someone knowledgable explain to me exactly how this
scheme system will work.  I'm all ears . . . .   :-)

Ready for enlightenment ....

-Helen


On 16 Sep 2002 at 17:39, Steven Bird wrote:

The OLAC metadata format provides two mechanisms for community-
specific resource description.  First, special refinements (metadata
elements and corresponding vocabularies) support compatible
description across the community.  For example, the subject.language
element, and the OLAC-Language vocabulary, permit all archives to
identify subject language in the same manner.  Second, every OLAC
element permits an optional scheme attribute for use by
sub-communities of OLAC.  For example, the scholars at Academia Sinica
can use their own naming scheme for Formosan languages and still
package it up using the OLAC metadata container.  This combination of
standard refinements and user-defined schemes seems to offer a
reasonable balance between interoperability and extensibility.

Over the past month, Gary and I have been reviewing the design of OLAC
metadata and have concluded that these parallel mechanisms are
unnecessary.  We think that with a *single* extension mechanism, OLAC
can provide even better interoperability and extensibility.  Moreover,
we think this can be done with less administrative and technical
infrastructure than before, making it still easier for archives to
participate in OLAC.


A. THE PRESENT SITUATION

We begin with a quick review of how the two existing mechanisms work
in OLAC metadata.  First, community-specific refinements are
represented using Dublin Core qualifications represented in XML.  Here
is an example for subject language:

  A resource about the Sikaiana language:
  <subject.language code="x-sil-SKY"/>

This refinement permits focussed searching and better precision/recall
than the corresponding Dublin Core element:

  <subject>The Sikaiana Language</subject>

The OLAC version is flexible in that the code attribute is optional
and that free-text can be put in the element content.

The second mechanism is for user-defined schemes.  All OLAC elements
permit a scheme attribute, naming some third-party format or
vocabulary that one or more OLAC archives use.  For instance, the
language listed by Ethnologue as Taroko (TRV) is known as Seediq in
Academia Sinica, and OLAC would permit either or both of the following
elements to appear in a metadata record for this language:

  <subject.language code="x-sil-TRV"/>
  <subject.language scheme="AS-Formosan">Seediq</subject.language>

Such a resource would be discovered under either naming scheme, and
Academia Sinica could provide end-user services that rewarded any archive
which employed its scheme for Formosan language identification.


B. PROBLEMS WITH THE PRESENT SITUATION

There are four general problems with the present situation.

1. Finalizing standard refinements.  Our track record at developing
controlled vocabularies over the past year indicates that we are not
going to be able to finalize all the vocabularies that the OLAC
metadata standard specifies in time for launching version 1.0 after
our December workshop.  Even if some vocabularies are finalized by
December, the discussion may be reopened any time a new kind of
archive joins OLAC.  However, each vocabulary revision must currently
be released as a new version of the entire OLAC metadata set, an
unacceptable bureaucratic obstacle.

2. The artificial distinction between refinements and schemes.  It is
not clear when a putative refinement is important enough to be adopted
as an OLAC standard, versus a user-defined scheme.  Some of the
refinements we recognize at present aren't as germane to the overall
enterprise as others (e.g. operating system vs subject language), and
may not have enough support to be retained.  Conversely, the community
is sure to develop new, useful ontologies that we don't support at
present, and we would need to change the OLAC metadata standard in
order to accommodate them.  Promoting a user-defined scheme to an OLAC
standard would necessitate a change in the XML representation, generating
unnecessary work for all archives that support the scheme.

3. Duplication of technical support.  User-defined schemes are likely
to involve controlled vocabularies, with the same needs as OLAC
vocabularies with respect to validation, translation to
human-readable form in service providers, and dumb-down to Dublin Core
for OAI interoperability.  At present, the necessary infrastructure
must be created twice over, once for each of the two mechanisms.

4. Idiosyncracies of XML schema.  XML schema is used to define the
well-formedness of OLAC records, but it is unable to express
co-occurrence constraints between attribute values.  This means that
we cannot have more than one vocabulary for an element, forcing us to
build structure into element names and multiply the names
(e.g. Format.markup, Format.cpu, Format.os, ...).  It is unfortunate
that such a fundamental aspect of the OLAC XML format depends on a
shortcoming of a tool that we may not be using for very long.

In sum, the current model will be difficult to manage over the long
term.  Administratively, it encourages us to seek premature closure on
issues of content description that can never be closed.  Technically,
it forces us to release new versions of the metadata format with each
vocabulary revision, and forces us to create software infrastructure to
support a mishmash of four syntactic extensions of DC:

   <element.EXT1 refine="EXT2" code="EXT3" scheme="EXT4">


C. A NEW APPROACH

In response to the problems outlined above, we would like to propose a
new approach.  The basic idea is simple: express all refinements,
vocabularies and schemes using a uniform DC extension mechanism, and
treat them all as recommendations instead of centrally-validated
standards.  The extension mechanism requires two attributes, called
"extension" and "code", as shown below:

  <subject extension="OLAC-Language" code="x-sil-SKY"/>
  <subject extension="AS-Formosan" code="Seediq"/>

It would be syntactically valid to simply use an extension in metadata
without defining it. However, for extensions that will be used across
the community, there must also be a formal definition that enumerates
the corresponding controlled vocabulary in such a way that data
providers and service providers alike can harvest the vocabulary from
its definitive source. Thus another aspect of the new approach is an
XML schema for the formal definition of an XDC extension. In the
description section of the OAI Identify response, a data provider
would declare which formally defined extensions it employs in its metadata.

Extensions that enjoyed broad community support would be identified as
OLAC Recommendations (following the existing OLAC Process).  All OLAC
archives would be encouraged to adopt them, in the sense that OLAC
service providers would permit end-users to perform focussed searches
over these extensions.  In this way, archives that cooperate with the
rest of the community are rewarded.

Note that the approach isn't specific to language archives, so we're
calling it extensible Dublin Core (XDC).  An example of the syntax
is available (an XML DTD, the equivalent XML schema, and an instance
document): http://www.language-archives.org/XDC/0.1/


D. BENEFITS

The new approach is technically simpler than the existing approach,
and neatly solves the four problems we reported.

1. Finalizing standard refinements.  The editors of OLAC vocabulary
documents would be empowered to edit the vocabulary into the future,
without concern for integration with new releases of the OLAC metadata
format.

2. The artificial distinction between refinements and schemes.  The
syntactic distinction is gone, being replaced by a semantic one: is
the vocabulary an OLAC Recommendation or not?  Any archive or group of
archives would be free to start using their own extensions without any
formal registration.  They could build a service to demonstrate the
merit of their extension, thereby encouraging other archives to adopt
it.  Once broad support had been established, they could build a case
for an OLAC Recommendation, leading to adoption across the community.

3. Duplication of technical support.  With the single extension
mechanism, we can provide uniform technical support for validation,
translation and dumb-down.

4. Idiosyncracies of XML schema.  We no longer give XML schema such
sway in determining our XML syntax.  Other XML and database technologies
will be used to test that an extension is used correctly.

In sum, the new approach is extensible, requiring no central
administration of extensions, and no coordination of vocabulary
revisions with new releases of the metadata format.  The new approach
also supports interoperability across the whole OLAC community (via
OLAC Recommendations) and also among OLAC sub-communities that want to
create their own special-purpose extensions.


E. IMPLICATIONS

We are still working out the technical implications for OLAC central
services (e.g. registration, Vida, ORE, etc), and we will only be able
to implement parts of this in time for the December meeting.  As
always, we would welcome donations of programmer time to help us.

The short-term implication for OLAC archives is completely trivial,
since only a simple syntactic change is required.

The most important implication of this change is that it reduces the
pressure to reach final agreement on OLAC vocabularies by our December
workshop.  But this isn't an excuse for us to slow down on that front.
On the contrary, it frees us up to find working solutions for the key
vocabularies that define us as a community.  These will always be
imperfect compromises that we can agree to work with and revise as
necessary, well into the future.

In sum, we hope we are not opening up a technical can of worms, but
facilitating progress on the substantive issues, our common
descriptive ontologies.  Therefore, we encourage people to identify a
particular extension that they would like to work on, and post their
ideas and questions to this list (as Baden Hughes has just now done
for sourcecode).  You may also like to present your ideas at our
workshop in December...

--

So, what do you think?  Do you agree with our proposals for
(i) a syntactic simplification in our XML representation, and
(ii) switching OLAC vocabularies from being centrally validated
standards to recommendations?  We would welcome your feedback.

Steven Bird & Gary Simons


From hdry at LINGUISTLIST.ORG  Thu Sep 26 23:36:44 2002
From: hdry at LINGUISTLIST.ORG (Helen Dry)
Date: Thu, 26 Sep 2002 19:36:44 -0400
Subject: A simpler format for OLAC vocabularies and schemes
In-Reply-To: <OLAC-IMPLEMENTERS%2002092410070062@LISTSERV.LINGUISTLIST.ORG>
Message-ID: <THU.26.SEP.2002.193644.0400.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

Hi, Gary (and everyone),

I've just sent a long posting to the list explaining some of my problems with Steven's
& Gary's proposal, so all I want to do here is respond briefly. I completely agree with
your point about the value of syntactic simplification.  But I'm not sure about the
second point--reducing all OLAC vocabularies to recommendations.  It's interesting
where our opinions diverge--i.e., you see the benefits to the archive, which may
already have a user-defined scheme, and I see the possible problems for the
general service provider, which may not be able to handle multiple user-defined
schemes in an efficient way.  Perhaps OLAC can handle this problem by making
STRONG recommendations . . . but in that case, I don't see the real difference
between recommendations and a centrally validated standard . . . except for the fact
that OLAC wouldn't have to re-publish all the metadata whenever a
recommendation changed.  I suppose this would be an administrative advantage--
but enough of a one to lose the potential benefits of standardization???  I'm waiting
to be convinced....

-Helen


On 24 Sep 2002 at 10:07, Gary Holton wrote:

On Mon, 16 Sep 2002 17:39:54 EDT, Steven Bird <sb at UNAGI.CIS.UPENN.EDU>
wrote:
>--
>
>So, what do you think?  Do you agree with our proposals for
>(i) a syntactic simplification in our XML representation, and
>(ii) switching OLAC vocabularies from being centrally validated
>standards to recommendations?  We would welcome your feedback.
>


Dear Steven & Gary,

I haven't had much time to digest your proposal, but my initial reaction is
very positive. Regarding (i), it is clear that a syntactic simplification
is needed. I for one have never been able to keep straight refinements vs.
schemes, and I don't think I'm alone here. And as you point out (ii), the
real issue should be not whether a particular refinement (and associated
vocabulary) has been officially adopted (mandated?), but rather whether a
such a refinement is useful to the community. We can debate ontologies, but
it is more difficult to debate usefulness without actually implementing a
refinement. Your proposal would permit refinements ("extensions") to fit
the needs of the community, so that useful solutions could evolve.

I have often approached the metadata issue by trying to imagine what types
of refinements and vocabularies would be useful to the end user. The
difficulty is that we don't know enough about how the user will be
searching, what they will be searching for, and what types of search
facilities they will have. The best we can do at this point is make an
educated guess and then watch closely to see how the refinements and
vocabularies are actually used. That said, I think we have some very good
guesses already and will certainly be able to recommend best practices by
December. However, if we lock in the vocabularies then most archives will
continue to have to support both an OLAC schema and a user-defined schema
(as you point out). This would essentially remove the data provider from
the loop, in that user-defined schemas would be viewed as idiosyncratic and
non-standard. Allowing user-defined "extensions" would encourage innovation
on the part of both data and service providers--innovation mediated by the
end user.

Any reactions from others?

Gary Holton


From Gary_Simons at SIL.ORG  Fri Sep 27 00:24:30 2002
From: Gary_Simons at SIL.ORG (Gary Simons)
Date: Thu, 26 Sep 2002 19:24:30 -0500
Subject: A simpler format for OLAC vocabularies and schemes
Message-ID: <THU.26.SEP.2002.192430.0500.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

Helen,

You hit the nail on the head when you observe: "in that case, I don't see
the real difference between recommendations and a centrally validated
standard".  It was that same observation, but coming from the point of view
of our status quo, that has been a key part of the motivation as Steven and
I have been thinking about what our version 1.0 standard should look like.

In version 0.4 we have a centrally validated and mandated standard, but it
has built-in optionality.  For instance, it is our standard to use SIL and
Linguist codes to identify languages precisely, but data providers also
have the option of just providing free text.  Thus the standard is
currently not requiring language codes but only recommending them as best
practice, and an examination of the harvested records from our 20 or so
participating data providers reveals the fact that many sites are not now
using codes.

Our proposal to take the controlled vocabularies out of the standard and to
treat them as best practice recommendations thus does not really change the
current reality.  In fact, it probably gives a better reflection of the
reality. One key advantage from the point of view of managing the
infrastructure is that it will not be necessary to change the standard when
controlled vocabularies are changed or added.  The metadata standard would
just specify the structure of the container record and the mechanism for
defining metadata extensions and would be very static.  Each controlled
vocabulary would be managed separately in an independent document and in a
formal extension definition that would supply downloadable code sets so
that extension data can still be centrally validated.  When the community
reaches a consensus that a particular vocabulary should be used when
applicable, then it would become a community Recommendation and our default
harvester would support it. Service providers would exploit it (such as
Linguist is now doing with searching by language) and that would show data
providers who are not yet using the vocabulary the benefits of using it.
We could even have a "Recommended practice report card" that would show
which recommended extensions an archive is using and which it is not.

Thus Steven and I are assuming that the end result of this change would not
weaken compliance to standardized vocabularies (which is already optional),
but that it would make it much easier to manage changes to vocabularies and
to experiment with specialized vocabularies.

I hope that helps to clarify where we are coming from.

-Gary Simons


                      Helen Dry <hdry at LINGUISTLIST.ORG>
                      Sent by: OLAC Implementers List            To:
                      <OLAC-IMPLEMENTERS at LISTSERV.LINGUI         OLAC-IMPLEMENTERS at LISTSERV.LINGUISTLIST
                      STLIST.ORG>                                .ORG
                                                                 cc:
                                                                 Subject: Re: A simpler format for OLAC
                      09/26/02 06:36 PM                          vocabularies and schemes
                      Please respond to Open Language
                      Archives Community Implementers
                      List


Hi, Gary (and everyone),

I've just sent a long posting to the list explaining some of my problems
with Steven's
& Gary's proposal, so all I want to do here is respond briefly. I
completely agree with
your point about the value of syntactic simplification.  But I'm not sure
about the
second point--reducing all OLAC vocabularies to recommendations.  It's
interesting
where our opinions diverge--i.e., you see the benefits to the archive,
which may
already have a user-defined scheme, and I see the possible problems for the
general service provider, which may not be able to handle multiple
user-defined
schemes in an efficient way.  Perhaps OLAC can handle this problem by
making
STRONG recommendations . . . but in that case, I don't see the real
difference
between recommendations and a centrally validated standard . . . except for
the fact
that OLAC wouldn't have to re-publish all the metadata whenever a
recommendation changed.  I suppose this would be an administrative
advantage--
but enough of a one to lose the potential benefits of standardization???
I'm waiting
to be convinced....

-Helen


On 24 Sep 2002 at 10:07, Gary Holton wrote:

On Mon, 16 Sep 2002 17:39:54 EDT, Steven Bird <sb at UNAGI.CIS.UPENN.EDU>
wrote:
>--
>
>So, what do you think?  Do you agree with our proposals for
>(i) a syntactic simplification in our XML representation, and
>(ii) switching OLAC vocabularies from being centrally validated
>standards to recommendations?  We would welcome your feedback.
>


Dear Steven & Gary,

I haven't had much time to digest your proposal, but my initial reaction is
very positive. Regarding (i), it is clear that a syntactic simplification
is needed. I for one have never been able to keep straight refinements vs.
schemes, and I don't think I'm alone here. And as you point out (ii), the
real issue should be not whether a particular refinement (and associated
vocabulary) has been officially adopted (mandated?), but rather whether a
such a refinement is useful to the community. We can debate ontologies, but
it is more difficult to debate usefulness without actually implementing a
refinement. Your proposal would permit refinements ("extensions") to fit
the needs of the community, so that useful solutions could evolve.

I have often approached the metadata issue by trying to imagine what types
of refinements and vocabularies would be useful to the end user. The
difficulty is that we don't know enough about how the user will be
searching, what they will be searching for, and what types of search
facilities they will have. The best we can do at this point is make an
educated guess and then watch closely to see how the refinements and
vocabularies are actually used. That said, I think we have some very good
guesses already and will certainly be able to recommend best practices by
December. However, if we lock in the vocabularies then most archives will
continue to have to support both an OLAC schema and a user-defined schema
(as you point out). This would essentially remove the data provider from
the loop, in that user-defined schemas would be viewed as idiosyncratic and
non-standard. Allowing user-defined "extensions" would encourage innovation
on the part of both data and service providers--innovation mediated by the
end user.

Any reactions from others?

Gary Holton


From hdry at LINGUISTLIST.ORG  Fri Sep 27 16:46:45 2002
From: hdry at LINGUISTLIST.ORG (Helen Aristar Dry)
Date: Fri, 27 Sep 2002 12:46:45 -0400
Subject: A simpler format for OLAC vocabularies and schemes
In-Reply-To: <OF9D7B4ADE.D5EB9713-ON86256C40.0083CA5B@sil.org>
Message-ID: <FRI.27.SEP.2002.124645.0400.>

Hi, Gary,

Yes, I take your point that we can't force compliance; and, in
general, I'd be all for letting standards evolve from usage.  But
actually, from the point of view of the LINGUIST service provider,
the languages example isn't a heartening one.  What our
programmer had to do to  search harvested OLAC metadata by
subject language is write a special program that translates any text
entry in the subject language field into the SIL code.   This is
possible to do with languages  only because we have the
Ethnologue name and alternate name tables on the site, and
therefore we have a list of almost all the language names that any
site might be using.  It's still a lot of work, and we're no doubt
missing or misclassifying the subject languages of a lot of records.
Nevertheless, we do have a search engine that is using Ethnologue
codes to identify resources by subject.language, thereby
demonstrating the utility of this recommendation.

But what are we going to do for linguistic data type and all the
other erstwhile controlled vocabularies?? There's no "alternate
name" reference for extensions (at least not as far as I know), such
that we could use it to write a translation program . .  even if it were
feasible to translate every relevant value in every metadata record.
And it makes no sense to set up search facilities that use the
recommended vocabulary if  there's no data classified by it--getting
a lot of "not found" messages will discourage users from using the
recommended vocabulary, not encourage it.  So our search engine
is not going to be any help in promulgating these recommendations.

Sigh.  I realize that mandating a controlled vocabulary wouldn't
ensure that archives used it.  Perhaps it would give them a little
more impetus, however.  And it would certainly be nice if each
archive would "translate" its user-defined metadata into the
recommended OLAC vocabulary, rather than leaving the service
provider to figure out how to do it  for multiple archives, each with
its own idiosyncratic and undocumented set of extensions.
I'm still hoping that you and Steven will come up with some bright
ideas about how to help/encourage/convince archives to do this . . .

Sorry to be negative.  You know I think OLAC is the best thing
since sliced bread. . . . I'm just having some trouble figuring out
how we're going to cope with the new-fangled slices....

All the best,
-Helen


Date sent:      	Thu, 26 Sep 2002 19:24:30 -0500
Send reply to:  	Open Language Archives Community Implementers List
             	<OLAC-IMPLEMENTERS at LISTSERV.LINGUISTLIST.ORG>
From:           	Gary Simons <Gary_Simons at SIL.ORG>
Subject:        	Re: A simpler format for OLAC vocabularies and schemes
To:             	OLAC-IMPLEMENTERS at LISTSERV.LINGUISTLIST.ORG

> Helen,
>
> You hit the nail on the head when you observe: "in that case, I don't see
> the real difference between recommendations and a centrally validated
> standard".  It was that same observation, but coming from the point of view
> of our status quo, that has been a key part of the motivation as Steven and
> I have been thinking about what our version 1.0 standard should look like.
>
> In version 0.4 we have a centrally validated and mandated standard, but it
> has built-in optionality.  For instance, it is our standard to use SIL and
> Linguist codes to identify languages precisely, but data providers also
> have the option of just providing free text.  Thus the standard is
> currently not requiring language codes but only recommending them as best
> practice, and an examination of the harvested records from our 20 or so
> participating data providers reveals the fact that many sites are not now
> using codes.
>
> Our proposal to take the controlled vocabularies out of the standard and to
> treat them as best practice recommendations thus does not really change the
> current reality.  In fact, it probably gives a better reflection of the
> reality. One key advantage from the point of view of managing the
> infrastructure is that it will not be necessary to change the standard when
> controlled vocabularies are changed or added.  The metadata standard would
> just specify the structure of the container record and the mechanism for
> defining metadata extensions and would be very static.  Each controlled
> vocabulary would be managed separately in an independent document and in a
> formal extension definition that would supply downloadable code sets so
> that extension data can still be centrally validated.  When the community
> reaches a consensus that a particular vocabulary should be used when
> applicable, then it would become a community Recommendation and our default
> harvester would support it. Service providers would exploit it (such as
> Linguist is now doing with searching by language) and that would show data
> providers who are not yet using the vocabulary the benefits of using it.
> We could even have a "Recommended practice report card" that would show
> which recommended extensions an archive is using and which it is not.
>
> Thus Steven and I are assuming that the end result of this change would not
> weaken compliance to standardized vocabularies (which is already optional),
> but that it would make it much easier to manage changes to vocabularies and
> to experiment with specialized vocabularies.
>
> I hope that helps to clarify where we are coming from.
>
> -Gary Simons
>
>
>
>
>
>                       Helen Dry <hdry at LINGUISTLIST.ORG>
>                       Sent by: OLAC Implementers List            To:
>                       <OLAC-IMPLEMENTERS at LISTSERV.LINGUI         OLAC-IMPLEMENTERS at LISTSERV.LINGU
ISTLIST
>                       STLIST.ORG>                                .ORG
>                                                                  cc:
>                                                                  Subject: Re: A simpler format fo
r OLAC
>                       09/26/02 06:36 PM                          vocabularies and schemes
>                       Please respond to Open Language
>                       Archives Community Implementers
>                       List
>
>
>
>
>
> Hi, Gary (and everyone),
>
> I've just sent a long posting to the list explaining some of my problems
> with Steven's
> & Gary's proposal, so all I want to do here is respond briefly. I
> completely agree with
> your point about the value of syntactic simplification.  But I'm not sure
> about the
> second point--reducing all OLAC vocabularies to recommendations.  It's
> interesting
> where our opinions diverge--i.e., you see the benefits to the archive,
> which may
> already have a user-defined scheme, and I see the possible problems for the
> general service provider, which may not be able to handle multiple
> user-defined
> schemes in an efficient way.  Perhaps OLAC can handle this problem by
> making
> STRONG recommendations . . . but in that case, I don't see the real
> difference
> between recommendations and a centrally validated standard . . . except for
> the fact
> that OLAC wouldn't have to re-publish all the metadata whenever a
> recommendation changed.  I suppose this would be an administrative
> advantage--
> but enough of a one to lose the potential benefits of standardization???
> I'm waiting
> to be convinced....
>
> -Helen
>
>
>
> On 24 Sep 2002 at 10:07, Gary Holton wrote:
>
> On Mon, 16 Sep 2002 17:39:54 EDT, Steven Bird <sb at UNAGI.CIS.UPENN.EDU>
> wrote:
> >--
> >
> >So, what do you think?  Do you agree with our proposals for
> >(i) a syntactic simplification in our XML representation, and
> >(ii) switching OLAC vocabularies from being centrally validated
> >standards to recommendations?  We would welcome your feedback.
> >
>
>
> Dear Steven & Gary,
>
> I haven't had much time to digest your proposal, but my initial reaction is
> very positive. Regarding (i), it is clear that a syntactic simplification
> is needed. I for one have never been able to keep straight refinements vs.
> schemes, and I don't think I'm alone here. And as you point out (ii), the
> real issue should be not whether a particular refinement (and associated
> vocabulary) has been officially adopted (mandated?), but rather whether a
> such a refinement is useful to the community. We can debate ontologies, but
> it is more difficult to debate usefulness without actually implementing a
> refinement. Your proposal would permit refinements ("extensions") to fit
> the needs of the community, so that useful solutions could evolve.
>
> I have often approached the metadata issue by trying to imagine what types
> of refinements and vocabularies would be useful to the end user. The
> difficulty is that we don't know enough about how the user will be
> searching, what they will be searching for, and what types of search
> facilities they will have. The best we can do at this point is make an
> educated guess and then watch closely to see how the refinements and
> vocabularies are actually used. That said, I think we have some very good
> guesses already and will certainly be able to recommend best practices by
> December. However, if we lock in the vocabularies then most archives will
> continue to have to support both an OLAC schema and a user-defined schema
> (as you point out). This would essentially remove the data provider from
> the loop, in that user-defined schemas would be viewed as idiosyncratic and
> non-standard. Allowing user-defined "extensions" would encourage innovation
> on the part of both data and service providers--innovation mediated by the
> end user.
>
> Any reactions from others?
>
> Gary Holton


From baden at COMPULING.NET  Mon Sep 16 13:15:50 2002
From: baden at COMPULING.NET (Baden Hughes)
Date: Mon, 16 Sep 2002 23:15:50 +1000
Subject: query about format.sourcecode
Message-ID: <MON.16.SEP.2002.231550.1000.>

Hi -

I've got a query about matters related to the element format.sourcecode

Currently the spec at http://www.language-archives.org/OLAC/olacms.html
assumes that software resources indexed by OLAC will be in source code
(and hence appropriate entries will be made under this tagset).

The recommendation is currently:

<format.sourcecode
code="PROGRAMMING_LANGUAGE">Comments</format.sourcecode>

There are several questions I have about this.

1) Do we need to clarify this even further as there are apparently two
distinct options from the archive contents I've been working with). One
is where the sourcecode requires compilation, the other is where
sourcecode is essentially a script (or series of scripts). Any
information about the "state" of the source code is likely to be
inconsistent at best across archives, and I suspect even within a single
archive. IMHO its relatively important to the end user of the OLAC
search engine as to what state the sourcecode is in (ie how applicable
is this code to the platforms I have access to).

2) In the case where software resources indexed by OLAC are distributed
in compiled form (ie not sourcecode) there's apparently not much more
room to encode this information either. Apart from not strictly being
something which belongs in a format.sourcecode element, the
recommendation I assume would be that you could standardise this again
by using the comment field, but the same consistency problem arises.
Again, IMHO its relatively important to the end user of the OLAC search
engine as to what state the sourcecode is in (ie can I just install and
run or is it more complex)

These two points may not represent large issues, but if the archives you
are dealing with have a lot of software which ranges from source scripts
in a range of languages, source for compilation for a range of
compilers, and compiled "ready to run" applications, the granularity of
this markup can be important and greatly assist with classification and
indexation of resources in an appropriate manner. Additionally, for the
less computer literate end users, this distinction is very important in
them effectively locating a resource which is appropriate to their
needs.

Baden


From sb at UNAGI.CIS.UPENN.EDU  Mon Sep 16 21:39:54 2002
From: sb at UNAGI.CIS.UPENN.EDU (Steven Bird)
Date: Mon, 16 Sep 2002 17:39:54 EDT
Subject: A simpler format for OLAC vocabularies and schemes
Message-ID: <MON.16.SEP.2002.173954.EDT.SB@LDC.UPENN.EDU>

The OLAC metadata format provides two mechanisms for community-
specific resource description.  First, special refinements (metadata
elements and corresponding vocabularies) support compatible
description across the community.  For example, the subject.language
element, and the OLAC-Language vocabulary, permit all archives to
identify subject language in the same manner.  Second, every OLAC
element permits an optional scheme attribute for use by
sub-communities of OLAC.  For example, the scholars at Academia Sinica
can use their own naming scheme for Formosan languages and still
package it up using the OLAC metadata container.  This combination of
standard refinements and user-defined schemes seems to offer a
reasonable balance between interoperability and extensibility.

Over the past month, Gary and I have been reviewing the design of OLAC
metadata and have concluded that these parallel mechanisms are
unnecessary.  We think that with a *single* extension mechanism, OLAC
can provide even better interoperability and extensibility.  Moreover,
we think this can be done with less administrative and technical
infrastructure than before, making it still easier for archives to
participate in OLAC.


A. THE PRESENT SITUATION

We begin with a quick review of how the two existing mechanisms work
in OLAC metadata.  First, community-specific refinements are
represented using Dublin Core qualifications represented in XML.  Here
is an example for subject language:

  A resource about the Sikaiana language:
  <subject.language code="x-sil-SKY"/>

This refinement permits focussed searching and better precision/recall
than the corresponding Dublin Core element:

  <subject>The Sikaiana Language</subject>

The OLAC version is flexible in that the code attribute is optional
and that free-text can be put in the element content.

The second mechanism is for user-defined schemes.  All OLAC elements
permit a scheme attribute, naming some third-party format or
vocabulary that one or more OLAC archives use.  For instance, the
language listed by Ethnologue as Taroko (TRV) is known as Seediq in
Academia Sinica, and OLAC would permit either or both of the following
elements to appear in a metadata record for this language:

  <subject.language code="x-sil-TRV"/>
  <subject.language scheme="AS-Formosan">Seediq</subject.language>

Such a resource would be discovered under either naming scheme, and
Academia Sinica could provide end-user services that rewarded any archive
which employed its scheme for Formosan language identification.


B. PROBLEMS WITH THE PRESENT SITUATION

There are four general problems with the present situation.

1. Finalizing standard refinements.  Our track record at developing
controlled vocabularies over the past year indicates that we are not
going to be able to finalize all the vocabularies that the OLAC
metadata standard specifies in time for launching version 1.0 after
our December workshop.  Even if some vocabularies are finalized by
December, the discussion may be reopened any time a new kind of
archive joins OLAC.  However, each vocabulary revision must currently
be released as a new version of the entire OLAC metadata set, an
unacceptable bureaucratic obstacle.

2. The artificial distinction between refinements and schemes.  It is
not clear when a putative refinement is important enough to be adopted
as an OLAC standard, versus a user-defined scheme.  Some of the
refinements we recognize at present aren't as germane to the overall
enterprise as others (e.g. operating system vs subject language), and
may not have enough support to be retained.  Conversely, the community
is sure to develop new, useful ontologies that we don't support at
present, and we would need to change the OLAC metadata standard in
order to accommodate them.  Promoting a user-defined scheme to an OLAC
standard would necessitate a change in the XML representation, generating
unnecessary work for all archives that support the scheme.

3. Duplication of technical support.  User-defined schemes are likely
to involve controlled vocabularies, with the same needs as OLAC
vocabularies with respect to validation, translation to
human-readable form in service providers, and dumb-down to Dublin Core
for OAI interoperability.  At present, the necessary infrastructure
must be created twice over, once for each of the two mechanisms.

4. Idiosyncracies of XML schema.  XML schema is used to define the
well-formedness of OLAC records, but it is unable to express
co-occurrence constraints between attribute values.  This means that
we cannot have more than one vocabulary for an element, forcing us to
build structure into element names and multiply the names
(e.g. Format.markup, Format.cpu, Format.os, ...).  It is unfortunate
that such a fundamental aspect of the OLAC XML format depends on a
shortcoming of a tool that we may not be using for very long.

In sum, the current model will be difficult to manage over the long
term.  Administratively, it encourages us to seek premature closure on
issues of content description that can never be closed.  Technically,
it forces us to release new versions of the metadata format with each
vocabulary revision, and forces us to create software infrastructure to
support a mishmash of four syntactic extensions of DC:

   <element.EXT1 refine="EXT2" code="EXT3" scheme="EXT4">


C. A NEW APPROACH

In response to the problems outlined above, we would like to propose a
new approach.  The basic idea is simple: express all refinements,
vocabularies and schemes using a uniform DC extension mechanism, and
treat them all as recommendations instead of centrally-validated
standards.  The extension mechanism requires two attributes, called
"extension" and "code", as shown below:

  <subject extension="OLAC-Language" code="x-sil-SKY"/>
  <subject extension="AS-Formosan" code="Seediq"/>

It would be syntactically valid to simply use an extension in metadata
without defining it. However, for extensions that will be used across
the community, there must also be a formal definition that enumerates
the corresponding controlled vocabulary in such a way that data
providers and service providers alike can harvest the vocabulary from
its definitive source. Thus another aspect of the new approach is an
XML schema for the formal definition of an XDC extension. In the
description section of the OAI Identify response, a data provider
would declare which formally defined extensions it employs in its metadata.

Extensions that enjoyed broad community support would be identified as
OLAC Recommendations (following the existing OLAC Process).  All OLAC
archives would be encouraged to adopt them, in the sense that OLAC
service providers would permit end-users to perform focussed searches
over these extensions.  In this way, archives that cooperate with the
rest of the community are rewarded.

Note that the approach isn't specific to language archives, so we're
calling it extensible Dublin Core (XDC).  An example of the syntax
is available (an XML DTD, the equivalent XML schema, and an instance
document): http://www.language-archives.org/XDC/0.1/


D. BENEFITS

The new approach is technically simpler than the existing approach,
and neatly solves the four problems we reported.

1. Finalizing standard refinements.  The editors of OLAC vocabulary
documents would be empowered to edit the vocabulary into the future,
without concern for integration with new releases of the OLAC metadata
format.

2. The artificial distinction between refinements and schemes.  The
syntactic distinction is gone, being replaced by a semantic one: is
the vocabulary an OLAC Recommendation or not?  Any archive or group of
archives would be free to start using their own extensions without any
formal registration.  They could build a service to demonstrate the
merit of their extension, thereby encouraging other archives to adopt
it.  Once broad support had been established, they could build a case
for an OLAC Recommendation, leading to adoption across the community.

3. Duplication of technical support.  With the single extension
mechanism, we can provide uniform technical support for validation,
translation and dumb-down.

4. Idiosyncracies of XML schema.  We no longer give XML schema such
sway in determining our XML syntax.  Other XML and database technologies
will be used to test that an extension is used correctly.

In sum, the new approach is extensible, requiring no central
administration of extensions, and no coordination of vocabulary
revisions with new releases of the metadata format.  The new approach
also supports interoperability across the whole OLAC community (via
OLAC Recommendations) and also among OLAC sub-communities that want to
create their own special-purpose extensions.


E. IMPLICATIONS

We are still working out the technical implications for OLAC central
services (e.g. registration, Vida, ORE, etc), and we will only be able
to implement parts of this in time for the December meeting.  As
always, we would welcome donations of programmer time to help us.

The short-term implication for OLAC archives is completely trivial,
since only a simple syntactic change is required.

The most important implication of this change is that it reduces the
pressure to reach final agreement on OLAC vocabularies by our December
workshop.  But this isn't an excuse for us to slow down on that front.
On the contrary, it frees us up to find working solutions for the key
vocabularies that define us as a community.  These will always be
imperfect compromises that we can agree to work with and revise as
necessary, well into the future.

In sum, we hope we are not opening up a technical can of worms, but
facilitating progress on the substantive issues, our common
descriptive ontologies.  Therefore, we encourage people to identify a
particular extension that they would like to work on, and post their
ideas and questions to this list (as Baden Hughes has just now done
for sourcecode).  You may also like to present your ideas at our
workshop in December...

--

So, what do you think?  Do you agree with our proposals for
(i) a syntactic simplification in our XML representation, and
(ii) switching OLAC vocabularies from being centrally validated
standards to recommendations?  We would welcome your feedback.

Steven Bird & Gary Simons


From sb at UNAGI.CIS.UPENN.EDU  Mon Sep 16 22:13:15 2002
From: sb at UNAGI.CIS.UPENN.EDU (Steven Bird)
Date: Mon, 16 Sep 2002 18:13:15 EDT
Subject: query about format.sourcecode
In-Reply-To: Your mail dated Monday 16 September, 2002.
Message-ID: <MON.16.SEP.2002.181315.EDT.SB@LDC.UPENN.EDU>

Baden Hughes <baden at compuling.net> wrote:
> I've got a query about matters related to the element format.sourcecode

Its good to see discussion of software resources for a change, and I hope
the maintainers of software archives (DFKI, TRACTOR) will contribute to
this discussion.

> Currently the spec at http://www.language-archives.org/OLAC/olacms.html
> assumes that software resources indexed by OLAC will be in source code
> (and hence appropriate entries will be made under this tagset).

Not quite - all OLAC elements are optional, and some elements are simply
inappropriate for some resources.  Software distributed in binary form only
doesn't need to be given any sourcecode descriptor.

> The recommendation is currently:
>
> <format.sourcecode
> code="PROGRAMMING_LANGUAGE">Comments</format.sourcecode>
>
> There are several questions I have about this.
>
> 1) Do we need to clarify this even further as there are apparently two
> distinct options from the archive contents I've been working with). One
> is where the sourcecode requires compilation, the other is where
> sourcecode is essentially a script (or series of scripts). Any
> information about the "state" of the source code is likely to be
> inconsistent at best across archives, and I suspect even within a single
> archive. IMHO its relatively important to the end user of the OLAC
> search engine as to what state the sourcecode is in (ie how applicable
> is this code to the platforms I have access to).

Good, so the end-user requirement here is to be able to answer the
question: "Can I run this software?"

> 2) In the case where software resources indexed by OLAC are distributed
> in compiled form (ie not sourcecode) there's apparently not much more
> room to encode this information either. Apart from not strictly being
> something which belongs in a format.sourcecode element, the
> recommendation I assume would be that you could standardise this again
> by using the comment field, but the same consistency problem arises.
> Again, IMHO its relatively important to the end user of the OLAC search
> engine as to what state the sourcecode is in (ie can I just install and
> run or is it more complex)

Right, so the end-user requirement here is to be able to answer the
question: "How much effort will be required to get this running?"

> These two points may not represent large issues, but if the archives you
> are dealing with have a lot of software which ranges from source scripts
> in a range of languages, source for compilation for a range of
> compilers, and compiled "ready to run" applications, the granularity of
> this markup can be important and greatly assist with classification and
> indexation of resources in an appropriate manner. Additionally, for the
> less computer literate end users, this distinction is very important in
> them effectively locating a resource which is appropriate to their
> needs.

Absolutely.  Currently we have vocabularies for Sourcecode, CPU, and OS.
However, we can modify of scrap them if they don't serve our needs for
resource description and discovery.  Perhaps we need a new vocabulary
that better describes the state of the sourcecode.

One way to proceed here is for Baden (and any others) to identify the full
range of end-user requirements (is it more than these two?) then propose
vocabularies that best serve these requirements...

-Steven

--
Steven.Bird at ldc.upenn.edu  http://www.ldc.upenn.edu/sb
Assoc Director, LDC; Adj Assoc Prof, CIS & Linguistics
Linguistic Data Consortium, University of Pennsylvania
3600 Market St, Suite 810, Philadelphia, PA 19104-2653


From baden at COMPULING.NET  Fri Sep 20 11:57:22 2002
From: baden at COMPULING.NET (Baden Hughes)
Date: Fri, 20 Sep 2002 21:57:22 +1000
Subject: proposed revision of format.os
Message-ID: <FRI.20.SEP.2002.215722.1000.>

In working with several archives and drawing on other IT experience, I'd
like to make some proposed changes to the format.os schema.

---
<?xml version="1.0" encoding="utf-8" ?>
<schema xmlns="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.language-archives.org/OLAC/0.4/">
<annotation>
<documentation>1.0 OLAC Schema for operating system types, Steven Bird,
4/27/01 1.1 draft OLAC Schema for operating system types, Baden Hughes,
19/09/02</documentation>
</annotation>
<simpleType name="OLAC-OS-Code">
  <restriction base="string">
  <enumeration value="Unix" />
  <enumeration value="Unix/Linux" />
  <enumeration value="Unix/Solaris" />
  <enumeration value="Unix/SunOS" />
  <enumeration value="Unix/SCO" />
  <enumeration value="Unix/AIX" />
  <enumeration value="Unix/BSD" />
  <enumeration value="Unix/FreeBSD" />
  <enumeration value="Unix/OpenBSD" />
  <enumeration value="Unix/NetBSD" />
  <enumeration value="Unix/DECAlpha" />
  <enumeration value="Unix/GNU-Hurd" />
  <enumeration value="Unix/HPBLS" />
  <enumeration value="Unix/HPUX" />
  <enumeration value="Unix/IRIX" />
  <enumeration value="Unix/AIX" />
  <enumeration value="Unix/UnixWare" />
  <enumeration value="Unix/Xenix" />
  <enumeration value="Unix/VMS" />
  <enumeration value="SonyClieOS" />
  <enumeration value="Amiga" />
  <enumeration value="PalmOS" />
  <enumeration value="BeOS" />
  <enumeration value="NextSTEP" />
  <enumeration value="MacOS" />
  <enumeration value="MacOS/OSX" />
  <enumeration value="OS2" />
  <enumeration value="MSDOS" />
  <enumeration value="4DOS" />
  <enumeration value="MSWindows" />
  <enumeration value="MSWindows/win31" />
  <enumeration value="MSWindows/win95" />
  <enumeration value="MSWindows/winNT" />
  <enumeration value="MSWindows/win98" />
  <enumeration value="MSWindows/win2k" />
  <enumeration value="MSWindows/winCE" />
  <enumeration value="MSWindows/winME" />
  <enumeration value="MSWindows/winXP" />
  <enumeration value="MSWindows/PocketPC" />
  <enumeration value="MSWindows/PocketPC2002" />
  <enumeration value="MSWindows/.NET" />
  </restriction>
  </simpleType>
  </schema>
---

You can also find this draft schema at
http://www.compuling.net/projects/olac/190902-draft-olac-format.os.xsd

These changes essentially add to the list if possible operating systems
that I've encountered in classifying software.

If preferred, I can circulate to the list. If there's others interested
in working on this document, I'm more than happy to collaborate.

Baden


From baden at COMPULING.NET  Fri Sep 20 12:15:53 2002
From: baden at COMPULING.NET (Baden Hughes)
Date: Fri, 20 Sep 2002 22:15:53 +1000
Subject: proposed revision of format.cpu
Message-ID: <FRI.20.SEP.2002.221553.1000.>

In working with several archives and drawing on other IT experience, I'd
like to make some proposed changes to the format.cpu schema, (without
regurgitating the entire history of computing in the process :-).

---
<?xml version="1.0" encoding="utf-8"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema"
        targetNamespace="http://www.language-archives.org/OLAC/0.4/">

  <annotation>
    <documentation>
      1.0 OLAC Schema for CPUs, Steven Bird, 5/7/01
      1.1 draft OLAC Schema for CPU, Baden Hughes, 19/09/02
    </documentation>
  </annotation>

  <simpleType name="OLAC-CPU-Code">
    <restriction base="string">
      <enumeration value="x86"/>
      <enumeration value="MIPS"/>
      <enumeration value="Alpha"/>
      <enumeration value="Sparc"/>
      <enumeration value="680x0"/>
      <enumeration value="PA-RISC"/>
      <enumeration value="ARM"/>
      <enumeration value="ARM32"/>
      <enumeration value="Itanium"/>
      <enumeration value="IBM System 360/370/390"/>
      <enumeration value="Clipper"/>
      <enumeration value="i370"/>
      <enumeration value="i860"/>
      <enumeration value="i960"/>
      <enumeration value="Power4"/>
      <enumeration value="Cray"/>
      <enumeration value="m68k"/>
      <enumeration value="m88k"/>
      <enumeration value="ns32k"/>
      <enumeration value="IBM rs6000"/>
      <enumeration value="IBM rs10000"/>
      <enumeration value="s390"/>
      <enumeration value="VAX"/>
      <enumeration value="PowerPC"/>
      <enumeration value="DragonBall"/>
      <enumeration value="StrongARM"/>
      <enumeration value="Intel 8088"/>
      <enumeration value="Intel 8080"/>
      <enumeration value="Intel 8008"/>
      <enumeration value="Intel IA-64"/>
      <enumeration value="Motorola 6502"/>
      <enumeration value="Motorola 6800"/>
      <enumeration value="Motorola ColdFire"/>
      <enumeration value="Zilog Z80"/>
      <enumeration value="Transmeta Caruso"/>
      <enumeration value="SH3"/>
      <enumeration value="x86_64"/>


    </restriction>
  </simpleType>

</schema>
---

You can also find this draft schema at
http://www.compuling.net/projects/olac/190902-draft-olac-format.cpu.xsd

These changes essentially add to the list if possible operating systems
that I've encountered in classifying cpu architectures relevant to
language software. This includes some older mid-range style
architectures and the latest handheld architectures.

If preferred, I can circulate to the list. If there's others interested
in working on this document, again I'm more than happy to collaborate.

Baden


From baden at COMPULING.NET  Mon Sep 23 02:06:04 2002
From: baden at COMPULING.NET (Baden Hughes)
Date: Mon, 23 Sep 2002 12:06:04 +1000
Subject: fproposed revision of format.sourcecode
Message-ID: <MON.23.SEP.2002.120604.1000.>

After a survey of several language archives, I'd like to propose some
possible changes to the format.sourceode schema. Essentially this list
is a list of programming languages of various types, in which software
may be written. This list includes those found at:
http://www.hypernews.org/HyperNews/get/computing/lang-list.html

A draft can be found online at:
http://www.compuling.net/projects/olac/220902-draft-olac-format.sourceco
de.xsd

Comments welcome.

Baden


From sb at UNAGI.CIS.UPENN.EDU  Mon Sep 23 06:38:54 2002
From: sb at UNAGI.CIS.UPENN.EDU (Steven Bird)
Date: Mon, 23 Sep 2002 02:38:54 EDT
Subject: fproposed revision of format.sourcecode
In-Reply-To: Your mail dated Monday 23 September, 2002.
Message-ID: <MON.23.SEP.2002.023854.EDT.SB@LDC.UPENN.EDU>

Baden Hughes <baden at compuling.net> wrote:
> After a survey of several language archives, I'd like to propose some
> possible changes to the format.sourceode schema. Essentially this list
> is a list of programming languages of various types, in which software
> may be written. This list includes those found at:
> http://www.hypernews.org/HyperNews/get/computing/lang-list.html
>
> A draft can be found online at:
> http://www.compuling.net/projects/olac/220902-draft-olac-format.sourcecode.xsd
>
> Comments welcome.

This is great - a 20-fold increase on the number listed in my original 0.4
list.  I grepped for a few obscure languages and they were all there.

I'd like to raise two low-level technical issues, capitalization and
whitespace.

First, 99% of the codes are all-caps, even though some programming language
names are not written like this (e.g. the list gives "PROLOG" but it should
really be "Prolog").  However, rather than having to settle disputes about
this question, I'd prefer it if we case-normalized everything.  What do
people think - should we standardize on uppercase?

Second, Baden's list includes many items with spaces, e.g. "OBJECTIVE
CAML".  However, it seems desirable to limit the range of characters that
can appear in a controlled vocabulary item (e.g. no accents) so that there
is no transmission problems etc.  In some contexts, such as hand-crafted
CGI Get requests and HTML anchors, it is a pain to have to manually escape
the space character.  Could we live with a restriction of no spaces -
i.e. replacing spaces with underscore?

** Note that neither of these issues is substantive, since each controlled
vocabulary item will be associated with a human readable form (including
translations into other languages).  For example, in Dublin Core, there is
a refinement named "hasVersion" with the human-readable label "Has
Version".  [http://www.dublincore.org/documents/dcmes-qualifiers/].
The plan is to do the same thing for OLAC vocabularies.

-Steven


From baden at COMPULING.NET  Mon Sep 23 07:00:36 2002
From: baden at COMPULING.NET (Baden Hughes)
Date: Mon, 23 Sep 2002 17:00:36 +1000
Subject: fproposed revision of format.sourcecode
In-Reply-To: <200209230639.g8N6csL10762@unagi.cis.upenn.edu>
Message-ID: <MON.23.SEP.2002.170036.1000.>

I've updated the format.sourcecode schema draft with:

-unnecessary whitespace removed
-whitespace normalized to underscores in enumeration values
-typos corrected

You can find the updated list here:

http://www.compuling.net/projects/olac/230902-draft-olac-format.sourceco
de.xsd

There's currently 285 programming languages listed on this schema. If
any one has any more to add, drop me an email.

Regards

Baden

> -----Original Message-----
> From: Steven Bird [mailto:sb at unagi.cis.upenn.edu]
> Sent: Monday, 23 September 2002 16:39
> To: baden at compuling.net
> Cc: OLAC-IMPLEMENTERS at LISTSERV.LINGUISTLIST.ORG
> Subject: Re: fproposed revision of format.sourcecode
>
>
>
> Baden Hughes <baden at compuling.net> wrote:
> > After a survey of several language archives, I'd like to
> propose some
> > possible changes to the format.sourceode schema.
> Essentially this list
> > is a list of programming languages of various types, in
> which software
> > may be written. This list includes those found at:
> > http://www.hypernews.org/HyperNews/get/computing/lang-list.html
> >
> > A draft can be found online at:
> >
> http://www.compuling.net/projects/olac/220902->
draft-olac-format.source
> > code.xsd
> >
> > Comments welcome.
>
> This is great - a 20-fold increase on the number listed in my
> original 0.4 list.  I grepped for a few obscure languages and
> they were all there.
>
> I'd like to raise two low-level technical issues,
> capitalization and whitespace.
>
> First, 99% of the codes are all-caps, even though some
> programming language names are not written like this (e.g.
> the list gives "PROLOG" but it should really be "Prolog").
> However, rather than having to settle disputes about this
> question, I'd prefer it if we case-normalized everything.
> What do people think - should we standardize on uppercase?
>
> Second, Baden's list includes many items with spaces, e.g.
> "OBJECTIVE CAML".  However, it seems desirable to limit the
> range of characters that can appear in a controlled
> vocabulary item (e.g. no accents) so that there is no
> transmission problems etc.  In some contexts, such as
> hand-crafted CGI Get requests and HTML anchors, it is a pain
> to have to manually escape the space character.  Could we
> live with a restriction of no spaces - i.e. replacing spaces
> with underscore?
>
> ** Note that neither of these issues is substantive, since
> each controlled vocabulary item will be associated with a
> human readable form (including translations into other
> languages).  For example, in Dublin Core, there is a
> refinement named "hasVersion" with the human-readable label
> "Has Version".
> [http://www.dublincore.org/documents/dcmes-> qualifiers/].
> The
> plan is to do the same thing for OLAC vocabularies.
>
> -Steven
>


From ruyng at GATE.SINICA.EDU.TW  Mon Sep 23 10:29:42 2002
From: ruyng at GATE.SINICA.EDU.TW (Ru-Yng Chang)
Date: Mon, 23 Sep 2002 06:29:42 -0400
Subject: fproposed revision of format.sourcecode
Message-ID: <MON.23.SEP.2002.062942.0400.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

Dear all,

I find the difference between the draft and
the code for program language of the standard of Chinese  catalogue from
National Central Library.
http://datas.ncl.edu.tw/catweb/2-1-2a.htm(Big-5 encoding.)

As the list.


---A-----------
ADAPTIVE SERVER ENTERPRISE
ADS-C
AL
ALPHARD
ANALITIK
ANNA
APL2


---B-----------
BCY/B


---C-----------
CADL
CALM
CANDE
CCL
CIP-L
CLIPPER
COLTS
COMSKEE
CONCURRENT_EUCLID


---D-----------
D.L.LOGO
DATAPLOT
DBL
DIST
DYNAMO


---E-----------
EDISON
ELAN


---F-----------
FOCUS
FRED


---G-----------
GHC
GLYPNIR


---H-----------
HYPERTALK


---I-----------
IDL
INFORMIX-4GL
INTERPRESS
ISETL
ISP


---J-----------
JAVA
JAVA_APPLET (INCLUED IN JAVA)
JAVA_WORKSHOP (INCLUED IN JAVA)
JOSEF


---K-----------
KHUWARIZMI
KYLIX


---L-----------
LISP
LOGLAN_82
LOGO
LOTUS_SCRIPT
LUCID


---M-----------
MACRO-11
MFC
MODULA-2
MOUSE


---M-----------
NATAL
NPL

---O-----------
OCCAM2
OPS5

---P-----------
PARAGON
PARLOG
PILOT
PLEASE
PL/1
PL/M51
PL/SQL
POP11
PORTAL
PSEUDOCODE
PUCMAT

---Q-----------
QEDIT


---R-----------
ROSS


---S-----------
S-ALGOL
SGML
SHELL
SIMNET
SMAL/80
SNAP
SNOBOL
SPECOL
SPITBOL
SQL/ORACLE
STAROFFICE
STEP_3
STEP_5
SURVIS


---T-----------
T
TIME_SERIES_PROCESSOR
TURBO
TUTOR


---U-----------
UCSD_PASCAL
UNIGRAPHICS
UNISON_AUTHOR_LANGUAGE


Ru-Yng


From baden at COMPULING.NET  Mon Sep 23 13:28:23 2002
From: baden at COMPULING.NET (Baden Hughes)
Date: Mon, 23 Sep 2002 23:28:23 +1000
Subject: proposed revision of format.sourcecode
In-Reply-To: <OLAC-IMPLEMENTERS%2002092306394299@LISTSERV.LINGUISTLIST.ORG>
Message-ID: <MON.23.SEP.2002.232823.1000.>

An updated version of the format.sourcecode schema is now available
online with additions from Ru-Yng Chang.

http://www.compuling.net/projects/olac/240902-draft-olac-format.sourceco
de.xsd

Regards

Baden


From gary.holton at UAF.EDU  Tue Sep 24 14:07:00 2002
From: gary.holton at UAF.EDU (Gary Holton)
Date: Tue, 24 Sep 2002 10:07:00 -0400
Subject: A simpler format for OLAC vocabularies and schemes
Message-ID: <TUE.24.SEP.2002.100700.0400.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

On Mon, 16 Sep 2002 17:39:54 EDT, Steven Bird <sb at UNAGI.CIS.UPENN.EDU>
wrote:
>--
>
>So, what do you think?  Do you agree with our proposals for
>(i) a syntactic simplification in our XML representation, and
>(ii) switching OLAC vocabularies from being centrally validated
>standards to recommendations?  We would welcome your feedback.
>


Dear Steven & Gary,

I haven't had much time to digest your proposal, but my initial reaction is
very positive. Regarding (i), it is clear that a syntactic simplification
is needed. I for one have never been able to keep straight refinements vs.
schemes, and I don't think I'm alone here. And as you point out (ii), the
real issue should be not whether a particular refinement (and associated
vocabulary) has been officially adopted (mandated?), but rather whether a
such a refinement is useful to the community. We can debate ontologies, but
it is more difficult to debate usefulness without actually implementing a
refinement. Your proposal would permit refinements ("extensions") to fit
the needs of the community, so that useful solutions could evolve.

I have often approached the metadata issue by trying to imagine what types
of refinements and vocabularies would be useful to the end user. The
difficulty is that we don't know enough about how the user will be
searching, what they will be searching for, and what types of search
facilities they will have. The best we can do at this point is make an
educated guess and then watch closely to see how the refinements and
vocabularies are actually used. That said, I think we have some very good
guesses already and will certainly be able to recommend best practices by
December. However, if we lock in the vocabularies then most archives will
continue to have to support both an OLAC schema and a user-defined schema
(as you point out). This would essentially remove the data provider from
the loop, in that user-defined schemas would be viewed as idiosyncratic and
non-standard. Allowing user-defined "extensions" would encourage innovation
on the part of both data and service providers--innovation mediated by the
end user.

Any reactions from others?

Gary Holton


From sb at UNAGI.CIS.UPENN.EDU  Tue Sep 24 22:25:11 2002
From: sb at UNAGI.CIS.UPENN.EDU (Steven Bird)
Date: Tue, 24 Sep 2002 18:25:11 EDT
Subject: A simpler format for OLAC vocabularies and schemes
Message-ID: <TUE.24.SEP.2002.182511.EDT.SB@LDC.UPENN.EDU>

Thanks for the positive feedback.  While we await more reactions let me
jump in and say that Gary and I are working on a revised version of
the proposal to bring it into line with new developments in the Dublin Core
Metadata Initiative (DCMI).  We'll preserve the new extensibility that
people seem to appreciate, but also make syntactic changes to maximize
interoperability with the wider digital libraries community.

In the past we've basically gone it alone in working out how to represent
our own DC qualifications in XML.  However, the timing of these
recommendations and our forthcoming workshop present us with a new
opportunity to standardize our implementation.

If you'd like to learn more about what's happening in DCMI with qualifiers
and XML, please see the following article and the material it cites:

  Recommendations for XML Schema for Qualified Dublin Core
  Proposal to DC Architecture Working Group
  http://www.ukoln.ac.uk/metadata/dcmi/xmlschema/

Next week we'll circulate a proposal for how OLAC can conform with this.

Note that this is only about XML implementation and not OLAC content.
For those who only care about disseminating metadata, conformance with the
DCMI recommendations will ensure maximal interoperability with the wider
digital libraries community, so that your metadata pops up all over
cyberspace.


Back on the subject of extensibility...  The key innovation in our recent
proposal, that we'd still like more feedback on, is for the OLAC
vocabularies to be changed from being centrally enforced standards to
recommended practices.  Under this model, any archive will be able to adopt
and promulgate its favorite ontologies, while the OLAC Process is still
used to identify community-agreed best practices that everyone should
follow.

For instance, consider the sourcecode vocabulary, which is only relevant to
the software archives and which may need constant updates.  Under the
proposed model, the vocabulary wouldn't actually need to reside on the OLAC
site; it could live wherever it could be easily maintained.  However, the
OLAC site would host the details of any associated working group, so that
others could discover the group and contribute to the revision of the
vocabulary.  It would also host any associated OLAC recommendation, so that
everyone would know that the OLAC community had adopted a certain
vocabulary as best practice.

-Steven


From jcgood at SOCRATES.BERKELEY.EDU  Tue Sep 24 23:23:50 2002
From: jcgood at SOCRATES.BERKELEY.EDU (Jeff Good)
Date: Tue, 24 Sep 2002 16:23:50 -0700
Subject: A simpler format for OLAC vocabularies and schemes
In-Reply-To: <200209162139.g8GLdsL27812@unagi.cis.upenn.edu>
Message-ID: <TUE.24.SEP.2002.162350.0700.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

Hello,

I wanted to say that I think the basic designs of the revisions proposed
by Steven and Gary are very good suggestions. I completely agree with
Gary Holton's points--so I won't repeat them.

I thought I'd point out how I think these revisions can be usefully
applied to some problems that the working group evaluating the
linguistic types document. I think this new format will allow us to get
past many issues which I thought may have been intractable. I guess I
consider this to be a good "empirical" test of the proposal.

The specific problem was that there are many cross-cutting ways to
classify the "type" of a linguistic document. There's a sense in which a
document focuses on a big sub-field of linguistics like phonology,
morphology, etc. There's the basic structure of a document: dictionary,
grammar, text (the term "macrostructure" can be used to describe this
category). And then there are important "meso/micro-structure" aspects
of documents---like the type of transcription used (free translation,
interlinear, etc.)

The original OLAC system encouraged us to create an ontology of document
types which assumed that there was one "type" for a document, when, in
reality, type is a multi-dimensional concept. As we realized this, we
started to break down the types into the most important dimensions--like
linguistic subject, basic structure, etc. But even then, there were
problems of classification. For example, categories like "oratory",
"narrative", "ludic" seemed appropriate for some linguistic
documents--but it isn't immediately clear where they belong in a
hierarchy of types (are they structural or content types? or are they
something else?).

It was possible to create a system of types which works, but I think
many of our conceptual and implementational problems can be more cleanly
solved by the new systems because of it extensibility.

Specifically, rather than having to pigeonhole types into a few
categories in a hierarchy, we can just propose a series of vocabularies
corresponding to the potentially independent "type" parameters of a
document--for example, a linguistic subject vocabulary, a document
structural type vocabulary, a "discourse"-type vocabulary for things
like "oratory" and "narrative". (For more detail on this, there are
relevant recent posts, one from me, on the Metadata list.)

Over time, I'm sure we'll find some of the vocabularies are more
useful/used than others--and these can become OLAC recommended standard
vocabularies. I think the real value of the new system will be that it
is much more forgiving/flexible if we find we need to adapt our "type"
categories in the future.

Since Steven just posted about the idea that vocabularies be recommended
practices, I'll say that I think that aspect of the proposal is also
very helpful to working out a linguistic type vocabulary. One thing that
at least I am convinced of in the discussion of "types" is that there is
a counterexample to every generalization you can make about them. It may
be the case that some counterexamples are minor enough that we can get
away without a good classification for them. Or it might be the case
that a counterexample is revealing a set of important omissions in the
proposals. It's hard to tell without testing a lot of archives.

A recommended, but not enforced, vocabulary would address this
problem--as archivers encounter situations that aren't covered, they
wouldn't be forced to "fit" their document into a category where it
doesn't belong. This would not only promote the creation of needed new
vocabulary items but also maintain the integrity of existing ones.

Additionally, the idea of recommended vocabularies, plus a best practice
standard, certainly is more in line with the general spirit of OLAC, and
I think it would encourage more subcommunities to get involved and
create vocabularies which they need.

Jeff


From baden at COMPULING.NET  Wed Sep 25 04:31:35 2002
From: baden at COMPULING.NET (Baden Hughes)
Date: Wed, 25 Sep 2002 14:31:35 +1000
Subject: A simpler format for OLAC vocabularies and schemes
In-Reply-To: <200209162139.g8GLdsL27812@unagi.cis.upenn.edu>
Message-ID: <WED.25.SEP.2002.143135.1000.>

> So, what do you think?  Do you agree with our proposals for
> (i) a syntactic simplification in our XML representation, and

The syntactic revision I personally agree with. Backwards and future
compatibility is a significant factor and as such the new revisions I
believe will make it easier to implement changes community wide and
benefit archives who require special purpose extensions.

> (ii) switching OLAC vocabularies from being centrally
> validated standards to recommendations?  We would welcome
> your feedback.

The proposal for recommendations rather than mandated standards seems to
draw partially on both the W3C and IETF processes, whereby drafts or
notes are submitted, reviewed, implemented and then reviewed with the
view to standardisation if agreed as best practice. This process scales
very well, and yet allows individuals or institutions the freedom to
innovate whilst encouraging best practice once peer review of
implementations has taken place. I think this is important to encourage
innovation amongst participating archives who develop vocabularies to
address their own needs first and then promote the benefits of these for
wider community consideration.

Baden


From hdry at LINGUISTLIST.ORG  Thu Sep 26 23:15:53 2002
From: hdry at LINGUISTLIST.ORG (Helen Dry)
Date: Thu, 26 Sep 2002 19:15:53 -0400
Subject: A simpler format for OLAC vocabularies and schemes
In-Reply-To: <200209162139.g8GLdsL27812@unagi.cis.upenn.edu>
Message-ID: <THU.26.SEP.2002.191553.0400.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

Hi, Steven (and everyone)

Sorry to be so late responding to this proposal, but it's been a busy month.

I am a little concerned about this proposal, perhaps because I don't understand
exactly how the scheme system would work, so I thought I should make my
comments and ask a few questions.  Apologies if either or both are at a rather
elementary level--I only seem to understand DC and XML for 10 minutes, right after
I reread the websites.  :-)

It seems to me that there are two separable proposals here:  (1) collapsing the
formal mechanisms of refinement and scheme into the extension mechanism and
(2) abandoning the attempt to reach general consensus on the descriptors that
previously we were calling controlled vocabularies. The first may well be a welcome
simplification, particularly administratively. (And I seem to have heard that it's the
way the DC is going anyway.)  The second seems worrisome to me for two primary
reasons:  (1) it seems counter to the overarching OLAC (and EMELD) goal of a
unified--dare we say "standardized"?--mechanism for resource description and
retrieval within the discipline; (2) on a practical level it may complicate--perhaps to
a debilitating degree--the way that  service providers implement search facilities.

Of course, I'm thinking about LINGUIST here--we aren't an archive, so the potential
benefits of being able to DESCRIBE resources via any scheme we might devise are
not salient to me.  What I'm worried about is how we're going to offer a search
engine that makes use of all these variant descriptions.  Particularly for something
like linguistic data types--which is probably the main search field linguists will want
to use--this seems almost like a return to the bad old days of the free text field, with
the consequent loss of ability to identify and retrieve relevant resources.

Now I imagine that there is some formal mechanism for relating schemes--I know
you have a paragraph below about archives putting the schemes they use in their
identifiers.   But could you tell me exactly how this would work in practice?  E.g., at
the level of elements or terms?  Would an archive that wants to use its own scheme
have to provide a document showing how its categories relate to the categories in
all the other schemes (e.g., that its "Seediq" was SIL's "Taroko.")   Would the
service provider have to construct a search engine that would first find and correlate
all these documents, then search the multi-archive metadata for the resulting sets of
terms?  I'm sure it's possible--IF you could get everyone to provide scheme
mappings--but it certainly seems unnecessarily complex. . . and, as I said, counter
to the purpose of OLAC.  I thought we were trying to settle on a unified way to
describe linguistic resources, in order to offer the discipline the benefits of a level of
standardization.  Though this will come at the admitted expense of a certain amount
of detail and precision, I feel confident that it will be accepted (accepted for what it
is) if we persevere. After all, DC isn't perfect but people understand the utility of a
restricted set of elements.

It seems to me that, if the problem is that we may not come up with a proposal
before December, we should either redouble our efforts and make the deadline or
extend the deadline--not scrap the enterprise.  Actually, with regard to linguistic data
types, I feel confident we can come up with a reasonable proposal before the
deadline.  And I think it's important that we do so, since this is really one of the most
important vocabularies--probably the most important for a large part of our
audience, i.e. academic linguists.  It's the main way that people, as opposed to
machines, will want to search the archives.

So, in sum, I agree with the arguments for using the extension mechanism and
abandoning refinement and scheme.  But I don't see the need to abandon the goal
of reaching consensus on a single "OLAC-approved" set of linguistic data types,
however that would be modeled in a world of "extensions" (not controlled
vocabularies).  Can we use extensions but not let in the world?

BTW, under the proposal, will all the current refinements--e.g., "subject.language"
now become schemes?

But now I should stop and let someone knowledgable explain to me exactly how this
scheme system will work.  I'm all ears . . . .   :-)

Ready for enlightenment ....

-Helen


On 16 Sep 2002 at 17:39, Steven Bird wrote:

The OLAC metadata format provides two mechanisms for community-
specific resource description.  First, special refinements (metadata
elements and corresponding vocabularies) support compatible
description across the community.  For example, the subject.language
element, and the OLAC-Language vocabulary, permit all archives to
identify subject language in the same manner.  Second, every OLAC
element permits an optional scheme attribute for use by
sub-communities of OLAC.  For example, the scholars at Academia Sinica
can use their own naming scheme for Formosan languages and still
package it up using the OLAC metadata container.  This combination of
standard refinements and user-defined schemes seems to offer a
reasonable balance between interoperability and extensibility.

Over the past month, Gary and I have been reviewing the design of OLAC
metadata and have concluded that these parallel mechanisms are
unnecessary.  We think that with a *single* extension mechanism, OLAC
can provide even better interoperability and extensibility.  Moreover,
we think this can be done with less administrative and technical
infrastructure than before, making it still easier for archives to
participate in OLAC.


A. THE PRESENT SITUATION

We begin with a quick review of how the two existing mechanisms work
in OLAC metadata.  First, community-specific refinements are
represented using Dublin Core qualifications represented in XML.  Here
is an example for subject language:

  A resource about the Sikaiana language:
  <subject.language code="x-sil-SKY"/>

This refinement permits focussed searching and better precision/recall
than the corresponding Dublin Core element:

  <subject>The Sikaiana Language</subject>

The OLAC version is flexible in that the code attribute is optional
and that free-text can be put in the element content.

The second mechanism is for user-defined schemes.  All OLAC elements
permit a scheme attribute, naming some third-party format or
vocabulary that one or more OLAC archives use.  For instance, the
language listed by Ethnologue as Taroko (TRV) is known as Seediq in
Academia Sinica, and OLAC would permit either or both of the following
elements to appear in a metadata record for this language:

  <subject.language code="x-sil-TRV"/>
  <subject.language scheme="AS-Formosan">Seediq</subject.language>

Such a resource would be discovered under either naming scheme, and
Academia Sinica could provide end-user services that rewarded any archive
which employed its scheme for Formosan language identification.


B. PROBLEMS WITH THE PRESENT SITUATION

There are four general problems with the present situation.

1. Finalizing standard refinements.  Our track record at developing
controlled vocabularies over the past year indicates that we are not
going to be able to finalize all the vocabularies that the OLAC
metadata standard specifies in time for launching version 1.0 after
our December workshop.  Even if some vocabularies are finalized by
December, the discussion may be reopened any time a new kind of
archive joins OLAC.  However, each vocabulary revision must currently
be released as a new version of the entire OLAC metadata set, an
unacceptable bureaucratic obstacle.

2. The artificial distinction between refinements and schemes.  It is
not clear when a putative refinement is important enough to be adopted
as an OLAC standard, versus a user-defined scheme.  Some of the
refinements we recognize at present aren't as germane to the overall
enterprise as others (e.g. operating system vs subject language), and
may not have enough support to be retained.  Conversely, the community
is sure to develop new, useful ontologies that we don't support at
present, and we would need to change the OLAC metadata standard in
order to accommodate them.  Promoting a user-defined scheme to an OLAC
standard would necessitate a change in the XML representation, generating
unnecessary work for all archives that support the scheme.

3. Duplication of technical support.  User-defined schemes are likely
to involve controlled vocabularies, with the same needs as OLAC
vocabularies with respect to validation, translation to
human-readable form in service providers, and dumb-down to Dublin Core
for OAI interoperability.  At present, the necessary infrastructure
must be created twice over, once for each of the two mechanisms.

4. Idiosyncracies of XML schema.  XML schema is used to define the
well-formedness of OLAC records, but it is unable to express
co-occurrence constraints between attribute values.  This means that
we cannot have more than one vocabulary for an element, forcing us to
build structure into element names and multiply the names
(e.g. Format.markup, Format.cpu, Format.os, ...).  It is unfortunate
that such a fundamental aspect of the OLAC XML format depends on a
shortcoming of a tool that we may not be using for very long.

In sum, the current model will be difficult to manage over the long
term.  Administratively, it encourages us to seek premature closure on
issues of content description that can never be closed.  Technically,
it forces us to release new versions of the metadata format with each
vocabulary revision, and forces us to create software infrastructure to
support a mishmash of four syntactic extensions of DC:

   <element.EXT1 refine="EXT2" code="EXT3" scheme="EXT4">


C. A NEW APPROACH

In response to the problems outlined above, we would like to propose a
new approach.  The basic idea is simple: express all refinements,
vocabularies and schemes using a uniform DC extension mechanism, and
treat them all as recommendations instead of centrally-validated
standards.  The extension mechanism requires two attributes, called
"extension" and "code", as shown below:

  <subject extension="OLAC-Language" code="x-sil-SKY"/>
  <subject extension="AS-Formosan" code="Seediq"/>

It would be syntactically valid to simply use an extension in metadata
without defining it. However, for extensions that will be used across
the community, there must also be a formal definition that enumerates
the corresponding controlled vocabulary in such a way that data
providers and service providers alike can harvest the vocabulary from
its definitive source. Thus another aspect of the new approach is an
XML schema for the formal definition of an XDC extension. In the
description section of the OAI Identify response, a data provider
would declare which formally defined extensions it employs in its metadata.

Extensions that enjoyed broad community support would be identified as
OLAC Recommendations (following the existing OLAC Process).  All OLAC
archives would be encouraged to adopt them, in the sense that OLAC
service providers would permit end-users to perform focussed searches
over these extensions.  In this way, archives that cooperate with the
rest of the community are rewarded.

Note that the approach isn't specific to language archives, so we're
calling it extensible Dublin Core (XDC).  An example of the syntax
is available (an XML DTD, the equivalent XML schema, and an instance
document): http://www.language-archives.org/XDC/0.1/


D. BENEFITS

The new approach is technically simpler than the existing approach,
and neatly solves the four problems we reported.

1. Finalizing standard refinements.  The editors of OLAC vocabulary
documents would be empowered to edit the vocabulary into the future,
without concern for integration with new releases of the OLAC metadata
format.

2. The artificial distinction between refinements and schemes.  The
syntactic distinction is gone, being replaced by a semantic one: is
the vocabulary an OLAC Recommendation or not?  Any archive or group of
archives would be free to start using their own extensions without any
formal registration.  They could build a service to demonstrate the
merit of their extension, thereby encouraging other archives to adopt
it.  Once broad support had been established, they could build a case
for an OLAC Recommendation, leading to adoption across the community.

3. Duplication of technical support.  With the single extension
mechanism, we can provide uniform technical support for validation,
translation and dumb-down.

4. Idiosyncracies of XML schema.  We no longer give XML schema such
sway in determining our XML syntax.  Other XML and database technologies
will be used to test that an extension is used correctly.

In sum, the new approach is extensible, requiring no central
administration of extensions, and no coordination of vocabulary
revisions with new releases of the metadata format.  The new approach
also supports interoperability across the whole OLAC community (via
OLAC Recommendations) and also among OLAC sub-communities that want to
create their own special-purpose extensions.


E. IMPLICATIONS

We are still working out the technical implications for OLAC central
services (e.g. registration, Vida, ORE, etc), and we will only be able
to implement parts of this in time for the December meeting.  As
always, we would welcome donations of programmer time to help us.

The short-term implication for OLAC archives is completely trivial,
since only a simple syntactic change is required.

The most important implication of this change is that it reduces the
pressure to reach final agreement on OLAC vocabularies by our December
workshop.  But this isn't an excuse for us to slow down on that front.
On the contrary, it frees us up to find working solutions for the key
vocabularies that define us as a community.  These will always be
imperfect compromises that we can agree to work with and revise as
necessary, well into the future.

In sum, we hope we are not opening up a technical can of worms, but
facilitating progress on the substantive issues, our common
descriptive ontologies.  Therefore, we encourage people to identify a
particular extension that they would like to work on, and post their
ideas and questions to this list (as Baden Hughes has just now done
for sourcecode).  You may also like to present your ideas at our
workshop in December...

--

So, what do you think?  Do you agree with our proposals for
(i) a syntactic simplification in our XML representation, and
(ii) switching OLAC vocabularies from being centrally validated
standards to recommendations?  We would welcome your feedback.

Steven Bird & Gary Simons


From hdry at LINGUISTLIST.ORG  Thu Sep 26 23:36:44 2002
From: hdry at LINGUISTLIST.ORG (Helen Dry)
Date: Thu, 26 Sep 2002 19:36:44 -0400
Subject: A simpler format for OLAC vocabularies and schemes
In-Reply-To: <OLAC-IMPLEMENTERS%2002092410070062@LISTSERV.LINGUISTLIST.ORG>
Message-ID: <THU.26.SEP.2002.193644.0400.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

Hi, Gary (and everyone),

I've just sent a long posting to the list explaining some of my problems with Steven's
& Gary's proposal, so all I want to do here is respond briefly. I completely agree with
your point about the value of syntactic simplification.  But I'm not sure about the
second point--reducing all OLAC vocabularies to recommendations.  It's interesting
where our opinions diverge--i.e., you see the benefits to the archive, which may
already have a user-defined scheme, and I see the possible problems for the
general service provider, which may not be able to handle multiple user-defined
schemes in an efficient way.  Perhaps OLAC can handle this problem by making
STRONG recommendations . . . but in that case, I don't see the real difference
between recommendations and a centrally validated standard . . . except for the fact
that OLAC wouldn't have to re-publish all the metadata whenever a
recommendation changed.  I suppose this would be an administrative advantage--
but enough of a one to lose the potential benefits of standardization???  I'm waiting
to be convinced....

-Helen


On 24 Sep 2002 at 10:07, Gary Holton wrote:

On Mon, 16 Sep 2002 17:39:54 EDT, Steven Bird <sb at UNAGI.CIS.UPENN.EDU>
wrote:
>--
>
>So, what do you think?  Do you agree with our proposals for
>(i) a syntactic simplification in our XML representation, and
>(ii) switching OLAC vocabularies from being centrally validated
>standards to recommendations?  We would welcome your feedback.
>


Dear Steven & Gary,

I haven't had much time to digest your proposal, but my initial reaction is
very positive. Regarding (i), it is clear that a syntactic simplification
is needed. I for one have never been able to keep straight refinements vs.
schemes, and I don't think I'm alone here. And as you point out (ii), the
real issue should be not whether a particular refinement (and associated
vocabulary) has been officially adopted (mandated?), but rather whether a
such a refinement is useful to the community. We can debate ontologies, but
it is more difficult to debate usefulness without actually implementing a
refinement. Your proposal would permit refinements ("extensions") to fit
the needs of the community, so that useful solutions could evolve.

I have often approached the metadata issue by trying to imagine what types
of refinements and vocabularies would be useful to the end user. The
difficulty is that we don't know enough about how the user will be
searching, what they will be searching for, and what types of search
facilities they will have. The best we can do at this point is make an
educated guess and then watch closely to see how the refinements and
vocabularies are actually used. That said, I think we have some very good
guesses already and will certainly be able to recommend best practices by
December. However, if we lock in the vocabularies then most archives will
continue to have to support both an OLAC schema and a user-defined schema
(as you point out). This would essentially remove the data provider from
the loop, in that user-defined schemas would be viewed as idiosyncratic and
non-standard. Allowing user-defined "extensions" would encourage innovation
on the part of both data and service providers--innovation mediated by the
end user.

Any reactions from others?

Gary Holton


From Gary_Simons at SIL.ORG  Fri Sep 27 00:24:30 2002
From: Gary_Simons at SIL.ORG (Gary Simons)
Date: Thu, 26 Sep 2002 19:24:30 -0500
Subject: A simpler format for OLAC vocabularies and schemes
Message-ID: <THU.26.SEP.2002.192430.0500.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

Helen,

You hit the nail on the head when you observe: "in that case, I don't see
the real difference between recommendations and a centrally validated
standard".  It was that same observation, but coming from the point of view
of our status quo, that has been a key part of the motivation as Steven and
I have been thinking about what our version 1.0 standard should look like.

In version 0.4 we have a centrally validated and mandated standard, but it
has built-in optionality.  For instance, it is our standard to use SIL and
Linguist codes to identify languages precisely, but data providers also
have the option of just providing free text.  Thus the standard is
currently not requiring language codes but only recommending them as best
practice, and an examination of the harvested records from our 20 or so
participating data providers reveals the fact that many sites are not now
using codes.

Our proposal to take the controlled vocabularies out of the standard and to
treat them as best practice recommendations thus does not really change the
current reality.  In fact, it probably gives a better reflection of the
reality. One key advantage from the point of view of managing the
infrastructure is that it will not be necessary to change the standard when
controlled vocabularies are changed or added.  The metadata standard would
just specify the structure of the container record and the mechanism for
defining metadata extensions and would be very static.  Each controlled
vocabulary would be managed separately in an independent document and in a
formal extension definition that would supply downloadable code sets so
that extension data can still be centrally validated.  When the community
reaches a consensus that a particular vocabulary should be used when
applicable, then it would become a community Recommendation and our default
harvester would support it. Service providers would exploit it (such as
Linguist is now doing with searching by language) and that would show data
providers who are not yet using the vocabulary the benefits of using it.
We could even have a "Recommended practice report card" that would show
which recommended extensions an archive is using and which it is not.

Thus Steven and I are assuming that the end result of this change would not
weaken compliance to standardized vocabularies (which is already optional),
but that it would make it much easier to manage changes to vocabularies and
to experiment with specialized vocabularies.

I hope that helps to clarify where we are coming from.

-Gary Simons


                      Helen Dry <hdry at LINGUISTLIST.ORG>
                      Sent by: OLAC Implementers List            To:
                      <OLAC-IMPLEMENTERS at LISTSERV.LINGUI         OLAC-IMPLEMENTERS at LISTSERV.LINGUISTLIST
                      STLIST.ORG>                                .ORG
                                                                 cc:
                                                                 Subject: Re: A simpler format for OLAC
                      09/26/02 06:36 PM                          vocabularies and schemes
                      Please respond to Open Language
                      Archives Community Implementers
                      List


Hi, Gary (and everyone),

I've just sent a long posting to the list explaining some of my problems
with Steven's
& Gary's proposal, so all I want to do here is respond briefly. I
completely agree with
your point about the value of syntactic simplification.  But I'm not sure
about the
second point--reducing all OLAC vocabularies to recommendations.  It's
interesting
where our opinions diverge--i.e., you see the benefits to the archive,
which may
already have a user-defined scheme, and I see the possible problems for the
general service provider, which may not be able to handle multiple
user-defined
schemes in an efficient way.  Perhaps OLAC can handle this problem by
making
STRONG recommendations . . . but in that case, I don't see the real
difference
between recommendations and a centrally validated standard . . . except for
the fact
that OLAC wouldn't have to re-publish all the metadata whenever a
recommendation changed.  I suppose this would be an administrative
advantage--
but enough of a one to lose the potential benefits of standardization???
I'm waiting
to be convinced....

-Helen


On 24 Sep 2002 at 10:07, Gary Holton wrote:

On Mon, 16 Sep 2002 17:39:54 EDT, Steven Bird <sb at UNAGI.CIS.UPENN.EDU>
wrote:
>--
>
>So, what do you think?  Do you agree with our proposals for
>(i) a syntactic simplification in our XML representation, and
>(ii) switching OLAC vocabularies from being centrally validated
>standards to recommendations?  We would welcome your feedback.
>


Dear Steven & Gary,

I haven't had much time to digest your proposal, but my initial reaction is
very positive. Regarding (i), it is clear that a syntactic simplification
is needed. I for one have never been able to keep straight refinements vs.
schemes, and I don't think I'm alone here. And as you point out (ii), the
real issue should be not whether a particular refinement (and associated
vocabulary) has been officially adopted (mandated?), but rather whether a
such a refinement is useful to the community. We can debate ontologies, but
it is more difficult to debate usefulness without actually implementing a
refinement. Your proposal would permit refinements ("extensions") to fit
the needs of the community, so that useful solutions could evolve.

I have often approached the metadata issue by trying to imagine what types
of refinements and vocabularies would be useful to the end user. The
difficulty is that we don't know enough about how the user will be
searching, what they will be searching for, and what types of search
facilities they will have. The best we can do at this point is make an
educated guess and then watch closely to see how the refinements and
vocabularies are actually used. That said, I think we have some very good
guesses already and will certainly be able to recommend best practices by
December. However, if we lock in the vocabularies then most archives will
continue to have to support both an OLAC schema and a user-defined schema
(as you point out). This would essentially remove the data provider from
the loop, in that user-defined schemas would be viewed as idiosyncratic and
non-standard. Allowing user-defined "extensions" would encourage innovation
on the part of both data and service providers--innovation mediated by the
end user.

Any reactions from others?

Gary Holton


From hdry at LINGUISTLIST.ORG  Fri Sep 27 16:46:45 2002
From: hdry at LINGUISTLIST.ORG (Helen Aristar Dry)
Date: Fri, 27 Sep 2002 12:46:45 -0400
Subject: A simpler format for OLAC vocabularies and schemes
In-Reply-To: <OF9D7B4ADE.D5EB9713-ON86256C40.0083CA5B@sil.org>
Message-ID: <FRI.27.SEP.2002.124645.0400.>

Hi, Gary,

Yes, I take your point that we can't force compliance; and, in
general, I'd be all for letting standards evolve from usage.  But
actually, from the point of view of the LINGUIST service provider,
the languages example isn't a heartening one.  What our
programmer had to do to  search harvested OLAC metadata by
subject language is write a special program that translates any text
entry in the subject language field into the SIL code.   This is
possible to do with languages  only because we have the
Ethnologue name and alternate name tables on the site, and
therefore we have a list of almost all the language names that any
site might be using.  It's still a lot of work, and we're no doubt
missing or misclassifying the subject languages of a lot of records.
Nevertheless, we do have a search engine that is using Ethnologue
codes to identify resources by subject.language, thereby
demonstrating the utility of this recommendation.

But what are we going to do for linguistic data type and all the
other erstwhile controlled vocabularies?? There's no "alternate
name" reference for extensions (at least not as far as I know), such
that we could use it to write a translation program . .  even if it were
feasible to translate every relevant value in every metadata record.
And it makes no sense to set up search facilities that use the
recommended vocabulary if  there's no data classified by it--getting
a lot of "not found" messages will discourage users from using the
recommended vocabulary, not encourage it.  So our search engine
is not going to be any help in promulgating these recommendations.

Sigh.  I realize that mandating a controlled vocabulary wouldn't
ensure that archives used it.  Perhaps it would give them a little
more impetus, however.  And it would certainly be nice if each
archive would "translate" its user-defined metadata into the
recommended OLAC vocabulary, rather than leaving the service
provider to figure out how to do it  for multiple archives, each with
its own idiosyncratic and undocumented set of extensions.
I'm still hoping that you and Steven will come up with some bright
ideas about how to help/encourage/convince archives to do this . . .

Sorry to be negative.  You know I think OLAC is the best thing
since sliced bread. . . . I'm just having some trouble figuring out
how we're going to cope with the new-fangled slices....

All the best,
-Helen


Date sent:      	Thu, 26 Sep 2002 19:24:30 -0500
Send reply to:  	Open Language Archives Community Implementers List
             	<OLAC-IMPLEMENTERS at LISTSERV.LINGUISTLIST.ORG>
From:           	Gary Simons <Gary_Simons at SIL.ORG>
Subject:        	Re: A simpler format for OLAC vocabularies and schemes
To:             	OLAC-IMPLEMENTERS at LISTSERV.LINGUISTLIST.ORG

> Helen,
>
> You hit the nail on the head when you observe: "in that case, I don't see
> the real difference between recommendations and a centrally validated
> standard".  It was that same observation, but coming from the point of view
> of our status quo, that has been a key part of the motivation as Steven and
> I have been thinking about what our version 1.0 standard should look like.
>
> In version 0.4 we have a centrally validated and mandated standard, but it
> has built-in optionality.  For instance, it is our standard to use SIL and
> Linguist codes to identify languages precisely, but data providers also
> have the option of just providing free text.  Thus the standard is
> currently not requiring language codes but only recommending them as best
> practice, and an examination of the harvested records from our 20 or so
> participating data providers reveals the fact that many sites are not now
> using codes.
>
> Our proposal to take the controlled vocabularies out of the standard and to
> treat them as best practice recommendations thus does not really change the
> current reality.  In fact, it probably gives a better reflection of the
> reality. One key advantage from the point of view of managing the
> infrastructure is that it will not be necessary to change the standard when
> controlled vocabularies are changed or added.  The metadata standard would
> just specify the structure of the container record and the mechanism for
> defining metadata extensions and would be very static.  Each controlled
> vocabulary would be managed separately in an independent document and in a
> formal extension definition that would supply downloadable code sets so
> that extension data can still be centrally validated.  When the community
> reaches a consensus that a particular vocabulary should be used when
> applicable, then it would become a community Recommendation and our default
> harvester would support it. Service providers would exploit it (such as
> Linguist is now doing with searching by language) and that would show data
> providers who are not yet using the vocabulary the benefits of using it.
> We could even have a "Recommended practice report card" that would show
> which recommended extensions an archive is using and which it is not.
>
> Thus Steven and I are assuming that the end result of this change would not
> weaken compliance to standardized vocabularies (which is already optional),
> but that it would make it much easier to manage changes to vocabularies and
> to experiment with specialized vocabularies.
>
> I hope that helps to clarify where we are coming from.
>
> -Gary Simons
>
>
>
>
>
>                       Helen Dry <hdry at LINGUISTLIST.ORG>
>                       Sent by: OLAC Implementers List            To:
>                       <OLAC-IMPLEMENTERS at LISTSERV.LINGUI         OLAC-IMPLEMENTERS at LISTSERV.LINGU
ISTLIST
>                       STLIST.ORG>                                .ORG
>                                                                  cc:
>                                                                  Subject: Re: A simpler format fo
r OLAC
>                       09/26/02 06:36 PM                          vocabularies and schemes
>                       Please respond to Open Language
>                       Archives Community Implementers
>                       List
>
>
>
>
>
> Hi, Gary (and everyone),
>
> I've just sent a long posting to the list explaining some of my problems
> with Steven's
> & Gary's proposal, so all I want to do here is respond briefly. I
> completely agree with
> your point about the value of syntactic simplification.  But I'm not sure
> about the
> second point--reducing all OLAC vocabularies to recommendations.  It's
> interesting
> where our opinions diverge--i.e., you see the benefits to the archive,
> which may
> already have a user-defined scheme, and I see the possible problems for the
> general service provider, which may not be able to handle multiple
> user-defined
> schemes in an efficient way.  Perhaps OLAC can handle this problem by
> making
> STRONG recommendations . . . but in that case, I don't see the real
> difference
> between recommendations and a centrally validated standard . . . except for
> the fact
> that OLAC wouldn't have to re-publish all the metadata whenever a
> recommendation changed.  I suppose this would be an administrative
> advantage--
> but enough of a one to lose the potential benefits of standardization???
> I'm waiting
> to be convinced....
>
> -Helen
>
>
>
> On 24 Sep 2002 at 10:07, Gary Holton wrote:
>
> On Mon, 16 Sep 2002 17:39:54 EDT, Steven Bird <sb at UNAGI.CIS.UPENN.EDU>
> wrote:
> >--
> >
> >So, what do you think?  Do you agree with our proposals for
> >(i) a syntactic simplification in our XML representation, and
> >(ii) switching OLAC vocabularies from being centrally validated
> >standards to recommendations?  We would welcome your feedback.
> >
>
>
> Dear Steven & Gary,
>
> I haven't had much time to digest your proposal, but my initial reaction is
> very positive. Regarding (i), it is clear that a syntactic simplification
> is needed. I for one have never been able to keep straight refinements vs.
> schemes, and I don't think I'm alone here. And as you point out (ii), the
> real issue should be not whether a particular refinement (and associated
> vocabulary) has been officially adopted (mandated?), but rather whether a
> such a refinement is useful to the community. We can debate ontologies, but
> it is more difficult to debate usefulness without actually implementing a
> refinement. Your proposal would permit refinements ("extensions") to fit
> the needs of the community, so that useful solutions could evolve.
>
> I have often approached the metadata issue by trying to imagine what types
> of refinements and vocabularies would be useful to the end user. The
> difficulty is that we don't know enough about how the user will be
> searching, what they will be searching for, and what types of search
> facilities they will have. The best we can do at this point is make an
> educated guess and then watch closely to see how the refinements and
> vocabularies are actually used. That said, I think we have some very good
> guesses already and will certainly be able to recommend best practices by
> December. However, if we lock in the vocabularies then most archives will
> continue to have to support both an OLAC schema and a user-defined schema
> (as you point out). This would essentially remove the data provider from
> the loop, in that user-defined schemas would be viewed as idiosyncratic and
> non-standard. Allowing user-defined "extensions" would encourage innovation
> on the part of both data and service providers--innovation mediated by the
> end user.
>
> Any reactions from others?
>
> Gary Holton