From sb at CS.MU.OZ.AU  Tue Oct  1 07:33:05 2002
From: sb at CS.MU.OZ.AU (Steven Bird)
Date: Tue, 1 Oct 2002 03:33:05 EDT
Subject: Call for Participation: OLAC Workshop
In-Reply-To: Your mail dated Thursday 8 August, 2002.
Message-ID: <TUE.1.OCT.2002.033305.EDT.GARYSIMONS@SIL.ORG>

Folks - the workshop is fast approaching; just over two months to go now.
If you haven't already done so, please communicate your intention to
participate to Gary and me, by replying to this email.

We'll be circulating more details about the workshop soon.  For now please
take a look at the list of preparatory tasks from the original call, which
I'm appending below.

Thanks,
Steven Bird

>
> 		    WORKSHOP ON OPEN LANGUAGE ARCHIVES
> 	    Institute for Research in Cognitive Science (IRCS)
> 		 University of Pennsylvania, Philadelphia
> 			   December 10-12, 2002
>
> 	   Sponsored by the National Science Foundation project:
> 	  International Standards in Language Engineering (ISLE)
>
>
> OLAC, the Open Language Archives Community, was founded at the
> Workshop on Web-Based Language Documentation and Description, in
> December 2000.  During 2001, the OLAC development phase, the core
> infrastructure for OLAC was built and alpha testers implemented data
> providers.  During 2002, the pilot phase, we froze the standards to
> encourage wider adoption and experience with the metadata and the
> protocol.  At the close of 2002 we want to draw together all this
> experience, make final revisions, and launch the operational phase.
> With this launch, the OLAC standards will be promoted from "candidate"
> to "adopted", and version 1.0 of the OLAC XML schemas will be released.
>
>
> WORKSHOP GOALS
>
> The workshop will be tightly focussed on the following goals:
>
> 1. Standards: To revise the three proposed standards, the OLAC
>    Metadata Set, the OLAC Process document and the OLAC Protocol.
>
> 2. Vocabularies: To finalize the controlled vocabularies: linguistic
>    type, software functionality, rights, format, encoding, ...
>
> 3. Review: To give feedback to each participating archive on its use
>    of metadata, to review the services on the OLAC and LINGUIST sites.
>
> 4. Proposals: To hear new proposals for working groups, encoding
>    schemes, implementation notes and best practice recommendations,
>    and position papers on work that still needs to be done.
>
> In support of these goals, the workshop will consist of:
> * group discussions, both plenary and in parallel working groups;
> * review/editing of documents, both in working groups and in private;
> * plus a limited number of presentations (cf goal 4).
>
> NB. No time will be allocated for project reports in the formal program.
>
>
> PARTICIPATION
>
> The workshop is open to advisory board members and representatives of
> participating archives, consistent with our core value of "Empowering
> the Players" [http://www.language-archives.org/OLAC/process.html].
>
> *** Please communicate your intention to participate by October 1.
>
> NB. If you have been thinking about becoming an OLAC data provider, now
> would be a good time to act. Any archive that becomes a data provider
> by October 1 will also be invited to participate in this foundation
> setting workshop.  For more information on becoming a data provider,
> please see http://www.language-archives.org/docs/implement.html
>
>
> SPONSORSHIP
>
> The workshop is being sponsored by the NSF ISLE project "International
> Standards in Language Engineering".  We have funding for accomodation
> at the University Sheraton, a short walk from IRCS.  No registration
> fee will be charged.  Some travel support may also be available.
>
>
> PREPARATORY TASKS
>
> In order to ensure that the workshop achieves its goals, participants
> will be expected to help create, review and edit draft documents ahead
> of the meeting.  We would like each person to contribute 1-2 days
> each month to this effort from September onwards.  The preparatory tasks
> correspond to our workshop goals, and are as follows:
>
> 1. Standards: review all the standards documents and suggest revisions
>
> 2. Vocabularies: review some of the controlled vocabularies and
>    suggest revisions
>
> 3. Review: choose three participating archives besides your own and
>    suggest improvements to their use of metadata; review the
>    www.language-archives.org site and the www.linguistlist.org/olac/
>    service and suggest improvements.
>
> 4. Proposals: draft an encoding scheme, an implementation note, a
>    best practice recommendation, or a proposal for anything else that
>    needs to be done, and present it to the group.
>
> The success of the workshop will depend on active participation in
> these tasks.  Comments circulated in advance will have the most impact
> on our work.  To facilitate the process we will use this list,
> OLAC-Implementers, except where formal working groups have already
> been established with their own lists.  Note that OLAC-Implementers is
> an open, unmoderated list, archived on the LINGUIST site at:
> http://lists.linguistlist.org/archives/olac-implementers.html
>
> More information will be circulated in September.  In the meantime,
> please feel free to get started on any of the above tasks...
>
> Steven Bird & Gary Simons
>
>


From sb at CS.MU.OZ.AU  Thu Oct  3 01:38:48 2002
From: sb at CS.MU.OZ.AU (Steven Bird)
Date: Wed, 2 Oct 2002 21:38:48 EDT
Subject: Some comments on the LINGUIST service provider
Message-ID: <WED.2.OCT.2002.213848.EDT.SB@CS.MU.OZ.AU>

One of the workshop preparatory tasks is:

> 3. Review: choose three participating archives besides your own and
>    suggest improvements to their use of metadata; review the
>    www.language-archives.org site and the www.linguistlist.org/olac/
>    service and suggest improvements.

I have three low-level comments on the LINGUIST service provider.  I hope
this feedback will make the service even better than it already is...

a) The first page you come to is a long document with a search form some
way down.

I'd favor a very simple page (cf www.google.com) consisting of a search
box, a link to the advanced search, and a link to "more about OLAC" which
has all the original text.

b) Users wanting "more powerful search" are directed to the "OLAC Query
page".  (Weren't we just on an OLAC query page?)  Arriving on this new page,
we see that it is called "OLAC Query Form: Simple Search".  This is
confusing, since we've just come from a simple search page expecting the
more powerful search page, only to find that this is still only simple
search.  There's no pointer back to the really simple search.

I'd prefer this to be called "Advanced Search" (both on the title and the
incoming link), with a backpointer to the simple search.

c) This second page points to yet another page, called Advanced Search.
However, this generates an error: "ODBC Error Code = S1000 (General error)
[TCX][MyODBC]Table 'OLAC.alltypes' doesn't exist".  I expect this really
advanced search permits search on all fields.

I'm not convinced we need three levels of search.  Could the second and
third levels be collapsed into a single level, containing all the search
fields?

Does anyone else have comments on this service?

-Steven

--
Steven Bird        Email: <sb at cs.mu.oz.au>  Web: http://www.cs.mu.oz.au/~sb/
A/Prof, Dept of Computer Science, University of Melbourne, Vic 3010, AUSTRALIA
Senior Research Assoc, Linguistic Data Consortium, University of Pennsylvania


From baden at COMPULING.NET  Thu Oct  3 10:24:37 2002
From: baden at COMPULING.NET (Baden Hughes)
Date: Thu, 3 Oct 2002 20:24:37 +1000
Subject: Some comments on the LINGUIST service provider
In-Reply-To: <200210030138.g931cmM07394@unagi.cis.upenn.edu>
Message-ID: <THU.3.OCT.2002.202437.1000.>

>From dealing with some new end users who have been introduced to OLAC
via the Linguist interface, I've got a couple of related comments.

Users would like to have a simple search - by title, author, description
and subject language. This would mean author would be added to the
existing Quick Search.

There is a difference between the number of archives actively searched
on the LL site and those registered at the OLAC site. I would have
assumed automated harvesting of the new archives as they are registered
at either location ?


An ultra-low level comment, when you click on the link at the bottom of
the LinguistList OLAC page:

"If you would like to help with the OLAC enterprise, please let us know!

Thank you in advance for your help!  "

An email message is launched, but there's no email address to send
things to (ie mailto: is malformed).

Baden


From baden at COMPULING.NET  Thu Oct  3 10:28:59 2002
From: baden at COMPULING.NET (Baden Hughes)
Date: Thu, 3 Oct 2002 20:28:59 +1000
Subject: OLAC resources
Message-ID: <THU.3.OCT.2002.202859.1000.>

FWIW, the format.cpu, format.os and format.sourcecode schemas are
available at http://www.compuling.net/projects/olac/ along with some
other OLAC resources under development.


Baden


From baden at COMPULING.NET  Thu Oct  3 12:14:29 2002
From: baden at COMPULING.NET (Baden Hughes)
Date: Thu, 3 Oct 2002 22:14:29 +1000
Subject: experimental schema: format.sourcestatus
In-Reply-To: <200209162213.g8GMDGL02117@unagi.cis.upenn.edu>
Message-ID: <THU.3.OCT.2002.221429.1000.>

Earlier I wrote to this list describing a problem I had found with the
schemas format.* in that the did not necessarily describe a certain
aspect of a software resource.

I believe retaining the format.cpu, format.os and format.sourcecode
vocabularies is beneficial. However, I would like to propose a new
addition to these, namely a schema for "format.sourcestatus", which
would be an optional controlled vocabulary, considered experimental only
at this stage.

The purpose of format.sourcestatus is to address two needs identified by
end users as critical to being able to evaluate a software and determine
its degree of utility to their own circumstances, eloquently expressed
by Steven Bird as:

> the end-user requirement here is to be able to answer the
> question: "Can I run this software?"

and

> the end-user requirement here is to be able to answer the
> question: "How much effort will be required to get this running?"

In addressing these questions, format.sourcestatus is a controlled
vocabulary that provides a range of descriptive options which assist the
user in identifying whether or not they can use the software resource in
question, and what additional requirements there will be to make it
work.

format.sourcestatus will contain enumeration values like the following:

	Pre-Compiled Binary
	Requires Compilation
	Requires Make
	Wrapped Installation
	Script

There is a rudimentary draft of this available at:

http://www.compuling.net/projects/olac/031002-draft-olac-format.sourcest
atus.xsd (URL may wrap)

It also occurs to me that format.sourcecode may not be the best name for
the controlled vocabulary. In essence, the identification performed by
this schema is of the language in which sourcecode is written.

Any comments ?

Baden


From sb at CS.MU.OZ.AU  Thu Oct  3 22:38:05 2002
From: sb at CS.MU.OZ.AU (Steven Bird)
Date: Thu, 3 Oct 2002 18:38:05 EDT
Subject: Some comments on the LINGUIST service provider
In-Reply-To: Your mail dated Sunday 3 November, 2002.
Message-ID: <THU.3.OCT.2002.183805.EDT.SB@CS.MU.OZ.AU>

Helen Aristar Dry wrote:
> But he suggests having a search blank, plus a full search.  I guess I
> just need to think about whether there's some way to do both what he
> suggests and what you suggest.

Would this work: a simple search page with a single keyword search field,
and an advanced search page in which the most salient fields (e.g. Baden's
list) appeared at the top?  Further fields could be separated off from the
main ones and/or be given in smaller type.

Steven Bird


From hdry at LINGUISTLIST.ORG  Thu Oct  3 23:12:17 2002
From: hdry at LINGUISTLIST.ORG (Helen Aristar Dry)
Date: Thu, 3 Oct 2002 19:12:17 -0400
Subject: Some comments on the LINGUIST service provider
In-Reply-To: <200210032238.g93Mc6M09019@unagi.cis.upenn.edu>
Message-ID: <THU.3.OCT.2002.191217.0400.>

Good idea, Steven.  Thanks.  -Helen

Date sent:      	Thu, 3 Oct 2002 18:38:05 EDT
Send reply to:  	Steven Bird <sb at cs.mu.oz.au>
From:           	Steven Bird <sb at CS.MU.OZ.AU>
Organization:   	University of Melbourne
Subject:        	Re: Some comments on the LINGUIST service provider
To:             	OLAC-IMPLEMENTERS at LISTSERV.LINGUISTLIST.ORG

> Helen Aristar Dry wrote:
> > But he suggests having a search blank, plus a full search.  I guess I
> > just need to think about whether there's some way to do both what he
> > suggests and what you suggest.
>
> Would this work: a simple search page with a single keyword search field,
> and an advanced search page in which the most salient fields (e.g. Baden's
> list) appeared at the top?  Further fields could be separated off from the
> main ones and/or be given in smaller type.
>
> Steven Bird


From baden at COMPULING.NET  Fri Oct  4 13:55:00 2002
From: baden at COMPULING.NET (Baden Hughes)
Date: Fri, 4 Oct 2002 23:55:00 +1000
Subject: experimental schema:type.functionality
Message-ID: <FRI.4.OCT.2002.235500.1000.>

The purpose of type.functionality is to describe the functionality of a
software resource.

There is a rudimentary draft of this available at:

http://www.compuling.net/projects/olac/041002-draft-olac-type.functional
ity.xsd (URL may wrap)

This is based on the categorization from the HLT Survey at
http://cslu.cse.ogi.edu/HLTsurvey/

Baden


From sb at CS.MU.OZ.AU  Mon Oct  7 02:34:22 2002
From: sb at CS.MU.OZ.AU (Steven Bird)
Date: Sun, 6 Oct 2002 22:34:22 EDT
Subject: experimental schema: format.sourcestatus
In-Reply-To: Your mail dated Thursday 3 October, 2002.
Message-ID: <SUN.6.OCT.2002.223422.EDT.SB@CS.MU.OZ.AU>

Last week Baden Hughes presented a new encoding scheme called source
status.  Here are some initial comments:

> Pre-Compiled Binary

or just "binary"?

> Requires Compilation
> Requires Make
> Wrapped Installation

These three are closely related - a build is required, and the
difference is in how much work the person has to do.

> Script

So a simple starting point here would be to have a three-way
distinction between binary, interpreted and compiled.

[Aside: In all three cases, other packages may need to be downloaded,
built and installed before the software can be run, and these will
need to be documented using the relation.requires element/refinement.
Presumably we won't bother specifying that a C compiler is required
for a resource that is specified as being in the C language, unless a
particular compiler/version is required.]

Notice that the distinction between interpreted and compiled is
largely predictable from the source language, and that the source code
might not actually be provided.  Therefore, we want to focus not on the
source code, but the nature of the distribution (format.distribution?).
Obviously, this now applies to data as well as software, since data can
come in binary or source forms, with our without wrapping.

The distribution methods include archives (tar, zip, rpm) which may be
compressed, and may be self-extracting or require other software.  The
self-extracting kind might actually manage the download and
registration process, as in the case of the CSLU toolkit.  To some
extent, the distribution method is predicable from the MIME type of
the file, which weakens the case for special treatment of distribution
types.

An orthogonal issue is size: can I download this over a modem line?

Anyway, to move things forward here, we may need to do some more study of
end-user needs.

-Steven

--
Steven Bird        Email: <sb at cs.mu.oz.au>  Web: http://www.cs.mu.oz.au/~sb/
A/Prof, Dept of Computer Science, University of Melbourne, Vic 3010, AUSTRALIA
Senior Research Assoc, Linguistic Data Consortium, University of Pennsylvania


From sb at CS.MU.OZ.AU  Fri Oct 18 02:11:48 2002
From: sb at CS.MU.OZ.AU (Steven Bird)
Date: Thu, 17 Oct 2002 22:11:48 EDT
Subject: Local arrangements in Philadelphia
Message-ID: <THU.17.OCT.2002.221148.EDT.SB@CS.MU.OZ.AU>

Folks,

I have now set up a website for the workshop at:
http://www.language-archives.org/events/olac02/

The most important information it contains now is the list of confirmed
participants and the arrangements for booking your hotel room.  Note that
we are paying for hotel rooms for the confirmed participants (except
local participants).

Please call the hotel to make your booking, using one of the numbers
on the website.  Please contact Laurel Sweeney at Penn if you encounter any
problems with the booking process.

Information about the workshop program will be posted next week.

Others who wish to attend need to contact me as soon as possible please.

Thanks,
-Steven

--
Steven Bird        Email: <sb at cs.mu.oz.au>  Web: http://www.cs.mu.oz.au/~sb/
A/Prof, Dept of Computer Science, University of Melbourne, Vic 3010, AUSTRALIA
Senior Research Assoc, Linguistic Data Consortium, University of Pennsylvania


From sb at CS.MU.OZ.AU  Wed Oct 23 10:26:21 2002
From: sb at CS.MU.OZ.AU (Steven Bird)
Date: Wed, 23 Oct 2002 06:26:21 EDT
Subject: workshop program
Message-ID: <WED.23.OCT.2002.062621.EDT.SB@CS.MU.OZ.AU>

Folks,

I'm sorry that the workshop program is long overdue.  There is a lot to
cover, and Gary and I would like to solicit your input on priorities, and
on the contributions of each participant.

We think the top level goals are:
1. to effect the transition to the operational phase of OLAC
2. to set the agenda for the coming year
3. to foster ongoing collaboration amongst the participants
   in the pursuit of the above

In support of these goals, the primary workshop activities need to be:
1. presenting and reviewing all the standards, understanding the
   implementation issues, and releasing version 1.0
2. finalizing, testing and documenting key recommendations - the metadata vocabularies
3. evaluating the community infrastructure - website, services, documentation

Here then is a comprehensive overview of the OLAC infrastructure, both
existing and planned, along with various suggestions about what we
could accomplish before/during the workshop, and who could possibly take
the lead in doing or delegating the work.  There is a lot here, but many
items can be dispensed with quickly (e.g. a 10 minute report), while some
big things that are beyond the scope of our workshop can be put on the
agenda of a working group for 2003.  I hope that the work will be shared
around, so that everyone has significant activites to do in the remaining
six weeks.

So please suggest priorities, identify any omissions, and volunteer to work
on something.  I'll convert this into a provisional program by the start of
next week.

Thanks,
-Steven

----

Annotations:
feedback: feedback requested before workshop
overview: a short presentation (10 minutes)
presentation: full presentation (20-30 minutes)
wg: working group(s) will process this


1. STANDARDS (Tuesday)

All of these need to be presented on day 1 (even if briefly) to make
sure there is enough time for feedback and consensus building if any
issues do arise.

a) OLAC-Process [feedback, overview] - Gary Simons?
   * present and discuss at start of workshop because it
     defines how we will operate even during the workshop

b) OLAC-PMH [overview, wg] - Steven Bird?
   * the primary issue will be the transition from OAI 1.1 to 2.0
   * those who implement data providers to discuss

c) OLAC Metadata Format [feedback, presentation, wg] - Steven Bird?
   * new work on representing OLAC metadata in XML
   * more information will be circulated this week
   * those who implement data providers to discuss

d) OLAC Metadata Extension Mechanism [presentation, wg] - Steven Bird?
   * how to express a vocabulary in a harvestable schema fragment
   * those who implement 3rd party extensions to discuss


2. RECOMMENDATIONS (Tuesday/Wednesday)

These are our vocabularies, along with any new proposals for recommendations
(e.g. best practices for digitizing audio recordings).

a) OLAC-Language [overview] - Gary Simons?, Anthony Aristar?

b) OLAC-Linguistic-Type [feedback, overview, wg?] - Heidi Johnson?, Helen Aristar Dry?
   * a vocabulary document to be circulated before the workshop
   * participants to apply the terms to their repository
   * the working group meeting may not be necessary

c) OLAC-Linguistic-Fields [feedback, overview] - Helen Aristar Dry?
   * a vocabulary document to be circulated before the workshop
   * participants to apply the terms to their repository

d) OLAC-Role: [feedback, overview, wg] - Heidi Johnson?
   * a vocabulary document to be circulated before the workshop
   * participants to apply the terms to their repository
   * still need to consider roles in the creation of language technologies
     and corpus publications

e) OLAC-Rights: [feedback, overview, wg] - Heidi Johnson?, Steven Bird?

Other vocabularies to consider OLAC-Encoding, OLAC-Format, OLAC-Functionality.
Time to be given to testing the vocabularies on existing repositories.


3. ARCHIVES AND SERVICES (Wednesday)

a) review metadata quality for existing archives [feedback]

b) OLAC website [feedback]

c) Registration [overview] - Gary Simons?

d) Vida/ORE/ORyX/OLACA/Viser [overview]
   * need to identify developers to help in 2003

e) LINGUIST [feedback, overview] - Helen Aristar Dry?, Anthony Aristar?


4. SUB-COMMUNITY EXTENSIONS (Wednesday)

a) Language technology [feedback, overview, wg] - Baden Hughes?
   * vocabulary documents to be circulated before the workshop
   * work on vocabularies for OS, CPU, Sourcecode, Distribution

b) Language documentation [overview, wg] - Heidi Johnson?
   * IMDI/OLAC mapping?
   * possible common vocabularies across IMDI and OLAC


5. IMPLEMENTATION NOTES (Wednesday/Thursday)

Useful tools that people have developed:

- exporting MS Access to ORyX files for Net-DC - Andrew Cole?
- Net-DC experience - Khalid Choukri?
- AILLA database model - Erik Grostic?


5. AGENDA FOR 2003 (Thursday)

a) more best practices
   * there are many areas where we need best practice recommendations
     [http://www.ldc.upenn.edu/sb/home/publications.html#0204020]
   * who wants to pick a need and start working on a recommendation?

b) more data providers
   * outreach, special needs, help with data providers
   * many subcommunities are creating resources
   * who wants to commit to helping them hook up with OLAC?
   + linguistics - accessible OLAC introduction - Jeff Good?
   + language technology
   + national archives
   + text archives
   + museum archives (e.g. 19C fieldwork materials)
   + antiquity (e.g. classical and ancient Near East text collections)
   + others?

c) more service providers
   * regional services (e.g. Asia)
   * services tailored for research needs (e.g. typology)

d) proposals for other work that needs to be done

--end--


From sb at CS.MU.OZ.AU  Thu Oct 31 06:41:07 2002
From: sb at CS.MU.OZ.AU (Steven Bird)
Date: Thu, 31 Oct 2002 01:41:07 EST
Subject: A simpler format for OLAC vocabularies and schemes
Message-ID: <THU.31.OCT.2002.014107.EST.SB@CS.MU.OZ.AU>

About six weeks ago, Gary Simons and I presented a schematic outline
for a new representation for OLAC metadata.  We described a single
extension mechanism that would provide better interoperability and
extensiblity, with less administrative and technical infrastructure
than before, with the goal of making it still easier for archives to
participate in OLAC.

About the same time we discovered very recent DCMI work on the XML
representation of DC and DC qualifiers:

  Guidelines for implementing Dublin Core in XML
  http://dublincore.org/documents/2002/09/09/dc-xml-guidelines/

  Recommendations for XML Schema for Qualified Dublin Core
  http://www.ukoln.ac.uk/metadata/dcmi/xmlschema/20021007/

These documents finally provide the DC XML framework that we had hoped
to find way back in January 2001, when we first started working on an
XML representation of our own Dublin Core qualifiers.

In the intervening six weeks we have figured out a new format for OLAC
metadata which implements our simplified extension mechanism, while
simultaneously re-using the new schemas from the DCMI.


REVIEW

To recap briefly, here are three examples showing OLAC 0.4 metadata,
the version in current use:

  <subject.language code="x-sil-BAN">Dschang</subject.language>
  <language scheme="AS-Formosan">Seediq</language>
  <contributor refine="editor">Sapir, Ned</contributor>

The examples illustrate several points:
(a) Element refinement: subject.language, editor (i.e. two different methods)
(b) OLAC encoding scheme: code="xxx"
(c) Free text element content, the escape hatch when OLAC codes don't fit
(d) A third party encoding scheme: scheme="xxx"

Here's the same information represented according to last month's
proposal for a simplified extension mechanism:

  <subject extension="OLAC-Language" code="x-sil-BAN">Dschang</subject>
  <language extension="AS-Formosan" code="Seediq"/>
  <contributor extension="OLAC-Role" code="editor">Sapir, Ned</contributor>

According to our proposal, this extension attribute would be used to
express all refinements, vocabularies and schemes, whether originating
from OLAC, an OLAC subcommunity, or an individual archive.  These
extensions wouldn't be centrally controlled, so individual archives
and groups of archives could develop their own extensions without any
community-wide approval process, and later demonstrate useful services
based on their extension in order to promote it to the community at
large.


REVISED REPRESENTATION

In the revised representation we are now proposing, the "extension"
attribute is renamed "xsi:type", and its value is given a namespace
prefix.  For example, the above three elements would be rewritten as
follows:

  <subject xsi:type="olac:language" code="x-sil-BAN">Dschang</subject>
  <language xsi:type="as:formosan" code="Seediq"/>
  <contributor xsi:type="olac:role" code="editor">Sapir, Ned</contributor>

This little change brings us into line with DCMI.  No longer do we
have to define DC and DC qualifiers ourselves, we can now simply
import the DCMI Schemas directly.  This means that OLAC metadata is
not simply a semantic extension of DC metadata as in the past, but the
OLAC metadata *format* is a *syntactic* extension of the DC metadata
format.


THE FILES

The schemas are posted at:
http://www.language-archives.org/OLAC/1.0b1/

The contents of the directory are as follows:

1. Example metadata record
* olac.xml

2. Top level OLAC schema
* olac.xsd

3. OLAC vocabularies (subject to approval at the December workshop)
* olac-date.xsd
* olac-language.xsd
* olac-linguistic-field.xsd
* olac-linguistic-type.xsd
* olac-role.xsd

4. Hypothetical third-party extensions (to be hosted off-site)

a) Academia Sinica Formosan language vocabulary
* third-party/as-formosan.xml
* third-party/as-formosan.xsd

b) LT-World Human Language Technology vocabulary
* third-party/ltworld-hlt-field.xml
* third-party/ltworld-hlt-field.xsd

c) Individual archive's own redefined OLAC vocabularies
* third-party/myolac.xml
* third-party/myolac.xsd

d) Networking Data Centers' vocabulary (LDC/ELRA)
* third-party/netdc.xml
* third-party/netdc.xsd

e) Software vocabularies
* third-party/software.xml
* third-party/software-cpu.xsd
* third-party/software-os.xsd
* third-party/software-sourcecode.xsd
* third-party/software.xsd

f) An example mixing three independent extensions
* third-party/combined.xml


TECHNICAL DISCUSSION

(a) About xsi:type

The xsi:type attribute is defined in the XML Schema standard. It is a
directive to a schema validator, telling it to override the definition
of the XML element with the named type definition. It uses the
namespace declaration to find the schema fragment that defines the
overriding type.  Thus, the attribute xsi:type="olac:language" says:
"take the DC definition of subject, add an optional "code" attribute,
and restrict the code values to the range specified in the schema for
olac:language.

(b) Harvesting

When harvesting these records, OLAC service providers will store OLAC
and third-party metadata elements in the same way, using columns for
the extension name (i.e. the value of the xsi:type attribute), for the
code, and for the element content.  In this way, coded values and
element content will be searchable for both OLAC and third-party
vocabularies alike.  However, only OLAC vocabularies would have
special services associated with them (e.g. the language codes service
built into the LINGUIST service provider).  The proposer of a new
extension could set up their own service provider to demonstrate the
value of their vocabulary in resource discovery and promote it to the
whole OLAC community.

(c) Dumb-down

Dumb-down from a third-party extension to OLAC, and dumb-down from
OLAC to DC, are straightforward to implement in this model.  Full
details will be circulated in a later message.

(d) Application profiles

An "application profile" is a hybrid metadata record that combines
elements and attributes that come from multiple authorities [1,2].
Under the newly proposed approach, we can conceive of OLAC metadata as
an application profile for the language resources community.
When a third party wants to extend the OLAC application profile, they
are actually creating a new application profile that combines DC and OLAC
metadata elements and attributes, along with their own.

[1] http://www.ariadne.ac.uk/issue25/app-profiles/
[2] http://dublincore.org/documents/library-application-profile/

(e) Copying the DCMI use of XML schemas

The decision to copy the DCMI's use of XML Schemas has two unfortunate
and unavoidable consequences.  First, the XML representation of DC and
OLAC metadata is tied to XML Schema validation.  If the validation
technology is ever changed, then the metadata format will need to be
changed.  Second, the xsi:type declarations are not constrained as to
which DC element they appear on.  If a metadata record used the role
vocabulary on an inappropriate element such as title, then the schema
validation would not report this error.

These are problems with the implementation decisions made by the
DC-Architecture Working Group, problems that we inherit.  We feel that
it is more important to conform to the DCMI and work with them to
address these issues, rather than continuing to work in isolation.

(f) Preserving a simple migration path

The new proposal maintains the simple migration path that is currently
permitted with OLAC 0.4.  This is an important feature for new
archives coming in to OLAC.  The following sequence illustrates the
migration path:

Step 1: archive maps their topic descriptor to the DC subject element:
  <subject>prosody</subject>

Step 2: archive uses the OLAC extension as a refinement, to state that
  the element content pertains to a linguistic field:
  <subject xsi:type="olac:linguistic-field">prosody</subject>

Step 3a: archive identifies the nearest OLAC code but retains
  their own data as a comment, to provide additional information:
  <subject xsi:type="olac:linguistic-field" code="phonology">prosody</subject>

OR
Step 3b: archive persuades community to accept a new vocabulary item:
  <subject xsi:type="olac:linguistic-field" code="prosody"/>

Note that step 3a illustrates an escape hatch for archives that have a
problem mapping their descriptors to OLAC vocabulary items.

Note also that this approach represents a minor deviation from the
DCMI approach, which puts coded values in the element content, leaving
no room for comments.


CONCLUSION

The revised proposal differs minimally from the previous proposal: the
"extension" element is renamed "xsi:type".

We believe this proposal represents a significant improvement on the
current OLAC 0.4 format in the areas of simplicity, interoperability
and extensibility.  Furthermore, it puts us squarely in the DC
community: OLAC won't have to reimplement each new DC Qualifier that
the DCMI adopts; OLAC can benefit from any software that works on DC
metadata; and OLAC vocabularies can be easily adopted outside the OLAC
community.

With your approval, we will document this new format and bring it up
at the December workshop as the proposal for OLAC version 1.0.  Once
adopted, each OLAC archive would be required to support it in order to
participate in OLAC.

Please send any comments to the list.

Steven Bird & Gary Simons


From Gary_Simons at SIL.ORG  Thu Oct 31 13:31:44 2002
From: Gary_Simons at SIL.ORG (Gary Simons)
Date: Thu, 31 Oct 2002 07:31:44 -0600
Subject: A simpler format for OLAC vocabularies and schemes
Message-ID: <THU.31.OCT.2002.073144.0600.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

On rereading our posting this morning, I realized that there is one major
feature of the new approach that we failed to mention since we were so
focused on explaining extensions to DC metadata. That issue is how the
refinements that are already defined by DCMI will work.

In OLAC 0.4, we used a "refine" attribute for the names of refinements
defined in the Qualified DC recommendation. We made this up in the absence
of any recommendation from DCMI as to how this should be implemented. If
you look up the new DCMI documents referenced in the main posting, you will
see that they have now addressed this issue, and their solution is to treat
the refinements as tags in their own right, but they are from the "dcterms"
namespace, rather than the "dc" namespace.

Thus, this from OLAC 0.4:

   <title>Orginal title</title>
   <title refine="alternative">Translated title</title>

would be the following in OLAC 1.0:

   <dc:title>Original title</dc:title>
   <dcterms:alternative>Translated title</dcterms:alternative>

N.B. Since our new solution is an application profile, the 15 main metadata
tags (like title in this example) are in the Dublin Core namespace rather
than our own.  In the examples that Steven has posted in the /OLAC/1.0b1/
directory, the Dublin Core namespace is declared to be the default
namespace, so that the above is actually expressed as:

   <title>Original title</title>
   <dcterms:alternative>Translated title</dcterms:alternative>

Anyway, I thought I should point out this difference between OLAC 0.4 and
the proposed 1.0 since it, too, will have an impact on your implementation
of data providers.

-Gary Simons


From sb at CS.MU.OZ.AU  Tue Oct  1 07:33:05 2002
From: sb at CS.MU.OZ.AU (Steven Bird)
Date: Tue, 1 Oct 2002 03:33:05 EDT
Subject: Call for Participation: OLAC Workshop
In-Reply-To: Your mail dated Thursday 8 August, 2002.
Message-ID: <TUE.1.OCT.2002.033305.EDT.GARYSIMONS@SIL.ORG>

Folks - the workshop is fast approaching; just over two months to go now.
If you haven't already done so, please communicate your intention to
participate to Gary and me, by replying to this email.

We'll be circulating more details about the workshop soon.  For now please
take a look at the list of preparatory tasks from the original call, which
I'm appending below.

Thanks,
Steven Bird

>
> 		    WORKSHOP ON OPEN LANGUAGE ARCHIVES
> 	    Institute for Research in Cognitive Science (IRCS)
> 		 University of Pennsylvania, Philadelphia
> 			   December 10-12, 2002
>
> 	   Sponsored by the National Science Foundation project:
> 	  International Standards in Language Engineering (ISLE)
>
>
> OLAC, the Open Language Archives Community, was founded at the
> Workshop on Web-Based Language Documentation and Description, in
> December 2000.  During 2001, the OLAC development phase, the core
> infrastructure for OLAC was built and alpha testers implemented data
> providers.  During 2002, the pilot phase, we froze the standards to
> encourage wider adoption and experience with the metadata and the
> protocol.  At the close of 2002 we want to draw together all this
> experience, make final revisions, and launch the operational phase.
> With this launch, the OLAC standards will be promoted from "candidate"
> to "adopted", and version 1.0 of the OLAC XML schemas will be released.
>
>
> WORKSHOP GOALS
>
> The workshop will be tightly focussed on the following goals:
>
> 1. Standards: To revise the three proposed standards, the OLAC
>    Metadata Set, the OLAC Process document and the OLAC Protocol.
>
> 2. Vocabularies: To finalize the controlled vocabularies: linguistic
>    type, software functionality, rights, format, encoding, ...
>
> 3. Review: To give feedback to each participating archive on its use
>    of metadata, to review the services on the OLAC and LINGUIST sites.
>
> 4. Proposals: To hear new proposals for working groups, encoding
>    schemes, implementation notes and best practice recommendations,
>    and position papers on work that still needs to be done.
>
> In support of these goals, the workshop will consist of:
> * group discussions, both plenary and in parallel working groups;
> * review/editing of documents, both in working groups and in private;
> * plus a limited number of presentations (cf goal 4).
>
> NB. No time will be allocated for project reports in the formal program.
>
>
> PARTICIPATION
>
> The workshop is open to advisory board members and representatives of
> participating archives, consistent with our core value of "Empowering
> the Players" [http://www.language-archives.org/OLAC/process.html].
>
> *** Please communicate your intention to participate by October 1.
>
> NB. If you have been thinking about becoming an OLAC data provider, now
> would be a good time to act. Any archive that becomes a data provider
> by October 1 will also be invited to participate in this foundation
> setting workshop.  For more information on becoming a data provider,
> please see http://www.language-archives.org/docs/implement.html
>
>
> SPONSORSHIP
>
> The workshop is being sponsored by the NSF ISLE project "International
> Standards in Language Engineering".  We have funding for accomodation
> at the University Sheraton, a short walk from IRCS.  No registration
> fee will be charged.  Some travel support may also be available.
>
>
> PREPARATORY TASKS
>
> In order to ensure that the workshop achieves its goals, participants
> will be expected to help create, review and edit draft documents ahead
> of the meeting.  We would like each person to contribute 1-2 days
> each month to this effort from September onwards.  The preparatory tasks
> correspond to our workshop goals, and are as follows:
>
> 1. Standards: review all the standards documents and suggest revisions
>
> 2. Vocabularies: review some of the controlled vocabularies and
>    suggest revisions
>
> 3. Review: choose three participating archives besides your own and
>    suggest improvements to their use of metadata; review the
>    www.language-archives.org site and the www.linguistlist.org/olac/
>    service and suggest improvements.
>
> 4. Proposals: draft an encoding scheme, an implementation note, a
>    best practice recommendation, or a proposal for anything else that
>    needs to be done, and present it to the group.
>
> The success of the workshop will depend on active participation in
> these tasks.  Comments circulated in advance will have the most impact
> on our work.  To facilitate the process we will use this list,
> OLAC-Implementers, except where formal working groups have already
> been established with their own lists.  Note that OLAC-Implementers is
> an open, unmoderated list, archived on the LINGUIST site at:
> http://lists.linguistlist.org/archives/olac-implementers.html
>
> More information will be circulated in September.  In the meantime,
> please feel free to get started on any of the above tasks...
>
> Steven Bird & Gary Simons
>
>


From sb at CS.MU.OZ.AU  Thu Oct  3 01:38:48 2002
From: sb at CS.MU.OZ.AU (Steven Bird)
Date: Wed, 2 Oct 2002 21:38:48 EDT
Subject: Some comments on the LINGUIST service provider
Message-ID: <WED.2.OCT.2002.213848.EDT.SB@CS.MU.OZ.AU>

One of the workshop preparatory tasks is:

> 3. Review: choose three participating archives besides your own and
>    suggest improvements to their use of metadata; review the
>    www.language-archives.org site and the www.linguistlist.org/olac/
>    service and suggest improvements.

I have three low-level comments on the LINGUIST service provider.  I hope
this feedback will make the service even better than it already is...

a) The first page you come to is a long document with a search form some
way down.

I'd favor a very simple page (cf www.google.com) consisting of a search
box, a link to the advanced search, and a link to "more about OLAC" which
has all the original text.

b) Users wanting "more powerful search" are directed to the "OLAC Query
page".  (Weren't we just on an OLAC query page?)  Arriving on this new page,
we see that it is called "OLAC Query Form: Simple Search".  This is
confusing, since we've just come from a simple search page expecting the
more powerful search page, only to find that this is still only simple
search.  There's no pointer back to the really simple search.

I'd prefer this to be called "Advanced Search" (both on the title and the
incoming link), with a backpointer to the simple search.

c) This second page points to yet another page, called Advanced Search.
However, this generates an error: "ODBC Error Code = S1000 (General error)
[TCX][MyODBC]Table 'OLAC.alltypes' doesn't exist".  I expect this really
advanced search permits search on all fields.

I'm not convinced we need three levels of search.  Could the second and
third levels be collapsed into a single level, containing all the search
fields?

Does anyone else have comments on this service?

-Steven

--
Steven Bird        Email: <sb at cs.mu.oz.au>  Web: http://www.cs.mu.oz.au/~sb/
A/Prof, Dept of Computer Science, University of Melbourne, Vic 3010, AUSTRALIA
Senior Research Assoc, Linguistic Data Consortium, University of Pennsylvania


From baden at COMPULING.NET  Thu Oct  3 10:24:37 2002
From: baden at COMPULING.NET (Baden Hughes)
Date: Thu, 3 Oct 2002 20:24:37 +1000
Subject: Some comments on the LINGUIST service provider
In-Reply-To: <200210030138.g931cmM07394@unagi.cis.upenn.edu>
Message-ID: <THU.3.OCT.2002.202437.1000.>

>From dealing with some new end users who have been introduced to OLAC
via the Linguist interface, I've got a couple of related comments.

Users would like to have a simple search - by title, author, description
and subject language. This would mean author would be added to the
existing Quick Search.

There is a difference between the number of archives actively searched
on the LL site and those registered at the OLAC site. I would have
assumed automated harvesting of the new archives as they are registered
at either location ?


An ultra-low level comment, when you click on the link at the bottom of
the LinguistList OLAC page:

"If you would like to help with the OLAC enterprise, please let us know!

Thank you in advance for your help!  "

An email message is launched, but there's no email address to send
things to (ie mailto: is malformed).

Baden


From baden at COMPULING.NET  Thu Oct  3 10:28:59 2002
From: baden at COMPULING.NET (Baden Hughes)
Date: Thu, 3 Oct 2002 20:28:59 +1000
Subject: OLAC resources
Message-ID: <THU.3.OCT.2002.202859.1000.>

FWIW, the format.cpu, format.os and format.sourcecode schemas are
available at http://www.compuling.net/projects/olac/ along with some
other OLAC resources under development.


Baden


From baden at COMPULING.NET  Thu Oct  3 12:14:29 2002
From: baden at COMPULING.NET (Baden Hughes)
Date: Thu, 3 Oct 2002 22:14:29 +1000
Subject: experimental schema: format.sourcestatus
In-Reply-To: <200209162213.g8GMDGL02117@unagi.cis.upenn.edu>
Message-ID: <THU.3.OCT.2002.221429.1000.>

Earlier I wrote to this list describing a problem I had found with the
schemas format.* in that the did not necessarily describe a certain
aspect of a software resource.

I believe retaining the format.cpu, format.os and format.sourcecode
vocabularies is beneficial. However, I would like to propose a new
addition to these, namely a schema for "format.sourcestatus", which
would be an optional controlled vocabulary, considered experimental only
at this stage.

The purpose of format.sourcestatus is to address two needs identified by
end users as critical to being able to evaluate a software and determine
its degree of utility to their own circumstances, eloquently expressed
by Steven Bird as:

> the end-user requirement here is to be able to answer the
> question: "Can I run this software?"

and

> the end-user requirement here is to be able to answer the
> question: "How much effort will be required to get this running?"

In addressing these questions, format.sourcestatus is a controlled
vocabulary that provides a range of descriptive options which assist the
user in identifying whether or not they can use the software resource in
question, and what additional requirements there will be to make it
work.

format.sourcestatus will contain enumeration values like the following:

	Pre-Compiled Binary
	Requires Compilation
	Requires Make
	Wrapped Installation
	Script

There is a rudimentary draft of this available at:

http://www.compuling.net/projects/olac/031002-draft-olac-format.sourcest
atus.xsd (URL may wrap)

It also occurs to me that format.sourcecode may not be the best name for
the controlled vocabulary. In essence, the identification performed by
this schema is of the language in which sourcecode is written.

Any comments ?

Baden


From sb at CS.MU.OZ.AU  Thu Oct  3 22:38:05 2002
From: sb at CS.MU.OZ.AU (Steven Bird)
Date: Thu, 3 Oct 2002 18:38:05 EDT
Subject: Some comments on the LINGUIST service provider
In-Reply-To: Your mail dated Sunday 3 November, 2002.
Message-ID: <THU.3.OCT.2002.183805.EDT.SB@CS.MU.OZ.AU>

Helen Aristar Dry wrote:
> But he suggests having a search blank, plus a full search.  I guess I
> just need to think about whether there's some way to do both what he
> suggests and what you suggest.

Would this work: a simple search page with a single keyword search field,
and an advanced search page in which the most salient fields (e.g. Baden's
list) appeared at the top?  Further fields could be separated off from the
main ones and/or be given in smaller type.

Steven Bird


From hdry at LINGUISTLIST.ORG  Thu Oct  3 23:12:17 2002
From: hdry at LINGUISTLIST.ORG (Helen Aristar Dry)
Date: Thu, 3 Oct 2002 19:12:17 -0400
Subject: Some comments on the LINGUIST service provider
In-Reply-To: <200210032238.g93Mc6M09019@unagi.cis.upenn.edu>
Message-ID: <THU.3.OCT.2002.191217.0400.>

Good idea, Steven.  Thanks.  -Helen

Date sent:      	Thu, 3 Oct 2002 18:38:05 EDT
Send reply to:  	Steven Bird <sb at cs.mu.oz.au>
From:           	Steven Bird <sb at CS.MU.OZ.AU>
Organization:   	University of Melbourne
Subject:        	Re: Some comments on the LINGUIST service provider
To:             	OLAC-IMPLEMENTERS at LISTSERV.LINGUISTLIST.ORG

> Helen Aristar Dry wrote:
> > But he suggests having a search blank, plus a full search.  I guess I
> > just need to think about whether there's some way to do both what he
> > suggests and what you suggest.
>
> Would this work: a simple search page with a single keyword search field,
> and an advanced search page in which the most salient fields (e.g. Baden's
> list) appeared at the top?  Further fields could be separated off from the
> main ones and/or be given in smaller type.
>
> Steven Bird


From baden at COMPULING.NET  Fri Oct  4 13:55:00 2002
From: baden at COMPULING.NET (Baden Hughes)
Date: Fri, 4 Oct 2002 23:55:00 +1000
Subject: experimental schema:type.functionality
Message-ID: <FRI.4.OCT.2002.235500.1000.>

The purpose of type.functionality is to describe the functionality of a
software resource.

There is a rudimentary draft of this available at:

http://www.compuling.net/projects/olac/041002-draft-olac-type.functional
ity.xsd (URL may wrap)

This is based on the categorization from the HLT Survey at
http://cslu.cse.ogi.edu/HLTsurvey/

Baden


From sb at CS.MU.OZ.AU  Mon Oct  7 02:34:22 2002
From: sb at CS.MU.OZ.AU (Steven Bird)
Date: Sun, 6 Oct 2002 22:34:22 EDT
Subject: experimental schema: format.sourcestatus
In-Reply-To: Your mail dated Thursday 3 October, 2002.
Message-ID: <SUN.6.OCT.2002.223422.EDT.SB@CS.MU.OZ.AU>

Last week Baden Hughes presented a new encoding scheme called source
status.  Here are some initial comments:

> Pre-Compiled Binary

or just "binary"?

> Requires Compilation
> Requires Make
> Wrapped Installation

These three are closely related - a build is required, and the
difference is in how much work the person has to do.

> Script

So a simple starting point here would be to have a three-way
distinction between binary, interpreted and compiled.

[Aside: In all three cases, other packages may need to be downloaded,
built and installed before the software can be run, and these will
need to be documented using the relation.requires element/refinement.
Presumably we won't bother specifying that a C compiler is required
for a resource that is specified as being in the C language, unless a
particular compiler/version is required.]

Notice that the distinction between interpreted and compiled is
largely predictable from the source language, and that the source code
might not actually be provided.  Therefore, we want to focus not on the
source code, but the nature of the distribution (format.distribution?).
Obviously, this now applies to data as well as software, since data can
come in binary or source forms, with our without wrapping.

The distribution methods include archives (tar, zip, rpm) which may be
compressed, and may be self-extracting or require other software.  The
self-extracting kind might actually manage the download and
registration process, as in the case of the CSLU toolkit.  To some
extent, the distribution method is predicable from the MIME type of
the file, which weakens the case for special treatment of distribution
types.

An orthogonal issue is size: can I download this over a modem line?

Anyway, to move things forward here, we may need to do some more study of
end-user needs.

-Steven

--
Steven Bird        Email: <sb at cs.mu.oz.au>  Web: http://www.cs.mu.oz.au/~sb/
A/Prof, Dept of Computer Science, University of Melbourne, Vic 3010, AUSTRALIA
Senior Research Assoc, Linguistic Data Consortium, University of Pennsylvania


From sb at CS.MU.OZ.AU  Fri Oct 18 02:11:48 2002
From: sb at CS.MU.OZ.AU (Steven Bird)
Date: Thu, 17 Oct 2002 22:11:48 EDT
Subject: Local arrangements in Philadelphia
Message-ID: <THU.17.OCT.2002.221148.EDT.SB@CS.MU.OZ.AU>

Folks,

I have now set up a website for the workshop at:
http://www.language-archives.org/events/olac02/

The most important information it contains now is the list of confirmed
participants and the arrangements for booking your hotel room.  Note that
we are paying for hotel rooms for the confirmed participants (except
local participants).

Please call the hotel to make your booking, using one of the numbers
on the website.  Please contact Laurel Sweeney at Penn if you encounter any
problems with the booking process.

Information about the workshop program will be posted next week.

Others who wish to attend need to contact me as soon as possible please.

Thanks,
-Steven

--
Steven Bird        Email: <sb at cs.mu.oz.au>  Web: http://www.cs.mu.oz.au/~sb/
A/Prof, Dept of Computer Science, University of Melbourne, Vic 3010, AUSTRALIA
Senior Research Assoc, Linguistic Data Consortium, University of Pennsylvania


From sb at CS.MU.OZ.AU  Wed Oct 23 10:26:21 2002
From: sb at CS.MU.OZ.AU (Steven Bird)
Date: Wed, 23 Oct 2002 06:26:21 EDT
Subject: workshop program
Message-ID: <WED.23.OCT.2002.062621.EDT.SB@CS.MU.OZ.AU>

Folks,

I'm sorry that the workshop program is long overdue.  There is a lot to
cover, and Gary and I would like to solicit your input on priorities, and
on the contributions of each participant.

We think the top level goals are:
1. to effect the transition to the operational phase of OLAC
2. to set the agenda for the coming year
3. to foster ongoing collaboration amongst the participants
   in the pursuit of the above

In support of these goals, the primary workshop activities need to be:
1. presenting and reviewing all the standards, understanding the
   implementation issues, and releasing version 1.0
2. finalizing, testing and documenting key recommendations - the metadata vocabularies
3. evaluating the community infrastructure - website, services, documentation

Here then is a comprehensive overview of the OLAC infrastructure, both
existing and planned, along with various suggestions about what we
could accomplish before/during the workshop, and who could possibly take
the lead in doing or delegating the work.  There is a lot here, but many
items can be dispensed with quickly (e.g. a 10 minute report), while some
big things that are beyond the scope of our workshop can be put on the
agenda of a working group for 2003.  I hope that the work will be shared
around, so that everyone has significant activites to do in the remaining
six weeks.

So please suggest priorities, identify any omissions, and volunteer to work
on something.  I'll convert this into a provisional program by the start of
next week.

Thanks,
-Steven

----

Annotations:
feedback: feedback requested before workshop
overview: a short presentation (10 minutes)
presentation: full presentation (20-30 minutes)
wg: working group(s) will process this


1. STANDARDS (Tuesday)

All of these need to be presented on day 1 (even if briefly) to make
sure there is enough time for feedback and consensus building if any
issues do arise.

a) OLAC-Process [feedback, overview] - Gary Simons?
   * present and discuss at start of workshop because it
     defines how we will operate even during the workshop

b) OLAC-PMH [overview, wg] - Steven Bird?
   * the primary issue will be the transition from OAI 1.1 to 2.0
   * those who implement data providers to discuss

c) OLAC Metadata Format [feedback, presentation, wg] - Steven Bird?
   * new work on representing OLAC metadata in XML
   * more information will be circulated this week
   * those who implement data providers to discuss

d) OLAC Metadata Extension Mechanism [presentation, wg] - Steven Bird?
   * how to express a vocabulary in a harvestable schema fragment
   * those who implement 3rd party extensions to discuss


2. RECOMMENDATIONS (Tuesday/Wednesday)

These are our vocabularies, along with any new proposals for recommendations
(e.g. best practices for digitizing audio recordings).

a) OLAC-Language [overview] - Gary Simons?, Anthony Aristar?

b) OLAC-Linguistic-Type [feedback, overview, wg?] - Heidi Johnson?, Helen Aristar Dry?
   * a vocabulary document to be circulated before the workshop
   * participants to apply the terms to their repository
   * the working group meeting may not be necessary

c) OLAC-Linguistic-Fields [feedback, overview] - Helen Aristar Dry?
   * a vocabulary document to be circulated before the workshop
   * participants to apply the terms to their repository

d) OLAC-Role: [feedback, overview, wg] - Heidi Johnson?
   * a vocabulary document to be circulated before the workshop
   * participants to apply the terms to their repository
   * still need to consider roles in the creation of language technologies
     and corpus publications

e) OLAC-Rights: [feedback, overview, wg] - Heidi Johnson?, Steven Bird?

Other vocabularies to consider OLAC-Encoding, OLAC-Format, OLAC-Functionality.
Time to be given to testing the vocabularies on existing repositories.


3. ARCHIVES AND SERVICES (Wednesday)

a) review metadata quality for existing archives [feedback]

b) OLAC website [feedback]

c) Registration [overview] - Gary Simons?

d) Vida/ORE/ORyX/OLACA/Viser [overview]
   * need to identify developers to help in 2003

e) LINGUIST [feedback, overview] - Helen Aristar Dry?, Anthony Aristar?


4. SUB-COMMUNITY EXTENSIONS (Wednesday)

a) Language technology [feedback, overview, wg] - Baden Hughes?
   * vocabulary documents to be circulated before the workshop
   * work on vocabularies for OS, CPU, Sourcecode, Distribution

b) Language documentation [overview, wg] - Heidi Johnson?
   * IMDI/OLAC mapping?
   * possible common vocabularies across IMDI and OLAC


5. IMPLEMENTATION NOTES (Wednesday/Thursday)

Useful tools that people have developed:

- exporting MS Access to ORyX files for Net-DC - Andrew Cole?
- Net-DC experience - Khalid Choukri?
- AILLA database model - Erik Grostic?


5. AGENDA FOR 2003 (Thursday)

a) more best practices
   * there are many areas where we need best practice recommendations
     [http://www.ldc.upenn.edu/sb/home/publications.html#0204020]
   * who wants to pick a need and start working on a recommendation?

b) more data providers
   * outreach, special needs, help with data providers
   * many subcommunities are creating resources
   * who wants to commit to helping them hook up with OLAC?
   + linguistics - accessible OLAC introduction - Jeff Good?
   + language technology
   + national archives
   + text archives
   + museum archives (e.g. 19C fieldwork materials)
   + antiquity (e.g. classical and ancient Near East text collections)
   + others?

c) more service providers
   * regional services (e.g. Asia)
   * services tailored for research needs (e.g. typology)

d) proposals for other work that needs to be done

--end--


From sb at CS.MU.OZ.AU  Thu Oct 31 06:41:07 2002
From: sb at CS.MU.OZ.AU (Steven Bird)
Date: Thu, 31 Oct 2002 01:41:07 EST
Subject: A simpler format for OLAC vocabularies and schemes
Message-ID: <THU.31.OCT.2002.014107.EST.SB@CS.MU.OZ.AU>

About six weeks ago, Gary Simons and I presented a schematic outline
for a new representation for OLAC metadata.  We described a single
extension mechanism that would provide better interoperability and
extensiblity, with less administrative and technical infrastructure
than before, with the goal of making it still easier for archives to
participate in OLAC.

About the same time we discovered very recent DCMI work on the XML
representation of DC and DC qualifiers:

  Guidelines for implementing Dublin Core in XML
  http://dublincore.org/documents/2002/09/09/dc-xml-guidelines/

  Recommendations for XML Schema for Qualified Dublin Core
  http://www.ukoln.ac.uk/metadata/dcmi/xmlschema/20021007/

These documents finally provide the DC XML framework that we had hoped
to find way back in January 2001, when we first started working on an
XML representation of our own Dublin Core qualifiers.

In the intervening six weeks we have figured out a new format for OLAC
metadata which implements our simplified extension mechanism, while
simultaneously re-using the new schemas from the DCMI.


REVIEW

To recap briefly, here are three examples showing OLAC 0.4 metadata,
the version in current use:

  <subject.language code="x-sil-BAN">Dschang</subject.language>
  <language scheme="AS-Formosan">Seediq</language>
  <contributor refine="editor">Sapir, Ned</contributor>

The examples illustrate several points:
(a) Element refinement: subject.language, editor (i.e. two different methods)
(b) OLAC encoding scheme: code="xxx"
(c) Free text element content, the escape hatch when OLAC codes don't fit
(d) A third party encoding scheme: scheme="xxx"

Here's the same information represented according to last month's
proposal for a simplified extension mechanism:

  <subject extension="OLAC-Language" code="x-sil-BAN">Dschang</subject>
  <language extension="AS-Formosan" code="Seediq"/>
  <contributor extension="OLAC-Role" code="editor">Sapir, Ned</contributor>

According to our proposal, this extension attribute would be used to
express all refinements, vocabularies and schemes, whether originating
from OLAC, an OLAC subcommunity, or an individual archive.  These
extensions wouldn't be centrally controlled, so individual archives
and groups of archives could develop their own extensions without any
community-wide approval process, and later demonstrate useful services
based on their extension in order to promote it to the community at
large.


REVISED REPRESENTATION

In the revised representation we are now proposing, the "extension"
attribute is renamed "xsi:type", and its value is given a namespace
prefix.  For example, the above three elements would be rewritten as
follows:

  <subject xsi:type="olac:language" code="x-sil-BAN">Dschang</subject>
  <language xsi:type="as:formosan" code="Seediq"/>
  <contributor xsi:type="olac:role" code="editor">Sapir, Ned</contributor>

This little change brings us into line with DCMI.  No longer do we
have to define DC and DC qualifiers ourselves, we can now simply
import the DCMI Schemas directly.  This means that OLAC metadata is
not simply a semantic extension of DC metadata as in the past, but the
OLAC metadata *format* is a *syntactic* extension of the DC metadata
format.


THE FILES

The schemas are posted at:
http://www.language-archives.org/OLAC/1.0b1/

The contents of the directory are as follows:

1. Example metadata record
* olac.xml

2. Top level OLAC schema
* olac.xsd

3. OLAC vocabularies (subject to approval at the December workshop)
* olac-date.xsd
* olac-language.xsd
* olac-linguistic-field.xsd
* olac-linguistic-type.xsd
* olac-role.xsd

4. Hypothetical third-party extensions (to be hosted off-site)

a) Academia Sinica Formosan language vocabulary
* third-party/as-formosan.xml
* third-party/as-formosan.xsd

b) LT-World Human Language Technology vocabulary
* third-party/ltworld-hlt-field.xml
* third-party/ltworld-hlt-field.xsd

c) Individual archive's own redefined OLAC vocabularies
* third-party/myolac.xml
* third-party/myolac.xsd

d) Networking Data Centers' vocabulary (LDC/ELRA)
* third-party/netdc.xml
* third-party/netdc.xsd

e) Software vocabularies
* third-party/software.xml
* third-party/software-cpu.xsd
* third-party/software-os.xsd
* third-party/software-sourcecode.xsd
* third-party/software.xsd

f) An example mixing three independent extensions
* third-party/combined.xml


TECHNICAL DISCUSSION

(a) About xsi:type

The xsi:type attribute is defined in the XML Schema standard. It is a
directive to a schema validator, telling it to override the definition
of the XML element with the named type definition. It uses the
namespace declaration to find the schema fragment that defines the
overriding type.  Thus, the attribute xsi:type="olac:language" says:
"take the DC definition of subject, add an optional "code" attribute,
and restrict the code values to the range specified in the schema for
olac:language.

(b) Harvesting

When harvesting these records, OLAC service providers will store OLAC
and third-party metadata elements in the same way, using columns for
the extension name (i.e. the value of the xsi:type attribute), for the
code, and for the element content.  In this way, coded values and
element content will be searchable for both OLAC and third-party
vocabularies alike.  However, only OLAC vocabularies would have
special services associated with them (e.g. the language codes service
built into the LINGUIST service provider).  The proposer of a new
extension could set up their own service provider to demonstrate the
value of their vocabulary in resource discovery and promote it to the
whole OLAC community.

(c) Dumb-down

Dumb-down from a third-party extension to OLAC, and dumb-down from
OLAC to DC, are straightforward to implement in this model.  Full
details will be circulated in a later message.

(d) Application profiles

An "application profile" is a hybrid metadata record that combines
elements and attributes that come from multiple authorities [1,2].
Under the newly proposed approach, we can conceive of OLAC metadata as
an application profile for the language resources community.
When a third party wants to extend the OLAC application profile, they
are actually creating a new application profile that combines DC and OLAC
metadata elements and attributes, along with their own.

[1] http://www.ariadne.ac.uk/issue25/app-profiles/
[2] http://dublincore.org/documents/library-application-profile/

(e) Copying the DCMI use of XML schemas

The decision to copy the DCMI's use of XML Schemas has two unfortunate
and unavoidable consequences.  First, the XML representation of DC and
OLAC metadata is tied to XML Schema validation.  If the validation
technology is ever changed, then the metadata format will need to be
changed.  Second, the xsi:type declarations are not constrained as to
which DC element they appear on.  If a metadata record used the role
vocabulary on an inappropriate element such as title, then the schema
validation would not report this error.

These are problems with the implementation decisions made by the
DC-Architecture Working Group, problems that we inherit.  We feel that
it is more important to conform to the DCMI and work with them to
address these issues, rather than continuing to work in isolation.

(f) Preserving a simple migration path

The new proposal maintains the simple migration path that is currently
permitted with OLAC 0.4.  This is an important feature for new
archives coming in to OLAC.  The following sequence illustrates the
migration path:

Step 1: archive maps their topic descriptor to the DC subject element:
  <subject>prosody</subject>

Step 2: archive uses the OLAC extension as a refinement, to state that
  the element content pertains to a linguistic field:
  <subject xsi:type="olac:linguistic-field">prosody</subject>

Step 3a: archive identifies the nearest OLAC code but retains
  their own data as a comment, to provide additional information:
  <subject xsi:type="olac:linguistic-field" code="phonology">prosody</subject>

OR
Step 3b: archive persuades community to accept a new vocabulary item:
  <subject xsi:type="olac:linguistic-field" code="prosody"/>

Note that step 3a illustrates an escape hatch for archives that have a
problem mapping their descriptors to OLAC vocabulary items.

Note also that this approach represents a minor deviation from the
DCMI approach, which puts coded values in the element content, leaving
no room for comments.


CONCLUSION

The revised proposal differs minimally from the previous proposal: the
"extension" element is renamed "xsi:type".

We believe this proposal represents a significant improvement on the
current OLAC 0.4 format in the areas of simplicity, interoperability
and extensibility.  Furthermore, it puts us squarely in the DC
community: OLAC won't have to reimplement each new DC Qualifier that
the DCMI adopts; OLAC can benefit from any software that works on DC
metadata; and OLAC vocabularies can be easily adopted outside the OLAC
community.

With your approval, we will document this new format and bring it up
at the December workshop as the proposal for OLAC version 1.0.  Once
adopted, each OLAC archive would be required to support it in order to
participate in OLAC.

Please send any comments to the list.

Steven Bird & Gary Simons


From Gary_Simons at SIL.ORG  Thu Oct 31 13:31:44 2002
From: Gary_Simons at SIL.ORG (Gary Simons)
Date: Thu, 31 Oct 2002 07:31:44 -0600
Subject: A simpler format for OLAC vocabularies and schemes
Message-ID: <THU.31.OCT.2002.073144.0600.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

On rereading our posting this morning, I realized that there is one major
feature of the new approach that we failed to mention since we were so
focused on explaining extensions to DC metadata. That issue is how the
refinements that are already defined by DCMI will work.

In OLAC 0.4, we used a "refine" attribute for the names of refinements
defined in the Qualified DC recommendation. We made this up in the absence
of any recommendation from DCMI as to how this should be implemented. If
you look up the new DCMI documents referenced in the main posting, you will
see that they have now addressed this issue, and their solution is to treat
the refinements as tags in their own right, but they are from the "dcterms"
namespace, rather than the "dc" namespace.

Thus, this from OLAC 0.4:

   <title>Orginal title</title>
   <title refine="alternative">Translated title</title>

would be the following in OLAC 1.0:

   <dc:title>Original title</dc:title>
   <dcterms:alternative>Translated title</dcterms:alternative>

N.B. Since our new solution is an application profile, the 15 main metadata
tags (like title in this example) are in the Dublin Core namespace rather
than our own.  In the examples that Steven has posted in the /OLAC/1.0b1/
directory, the Dublin Core namespace is declared to be the default
namespace, so that the above is actually expressed as:

   <title>Original title</title>
   <dcterms:alternative>Translated title</dcterms:alternative>

Anyway, I thought I should point out this difference between OLAC 0.4 and
the proposed 1.0 since it, too, will have an impact on your implementation
of data providers.

-Gary Simons