From baden at COMPULING.NET Mon Sep 16 13:15:50 2002 From: baden at COMPULING.NET (Baden Hughes) Date: Mon, 16 Sep 2002 23:15:50 +1000 Subject: query about format.sourcecode Message-ID: Hi - I've got a query about matters related to the element format.sourcecode Currently the spec at http://www.language-archives.org/OLAC/olacms.html assumes that software resources indexed by OLAC will be in source code (and hence appropriate entries will be made under this tagset). The recommendation is currently: Comments There are several questions I have about this. 1) Do we need to clarify this even further as there are apparently two distinct options from the archive contents I've been working with). One is where the sourcecode requires compilation, the other is where sourcecode is essentially a script (or series of scripts). Any information about the "state" of the source code is likely to be inconsistent at best across archives, and I suspect even within a single archive. IMHO its relatively important to the end user of the OLAC search engine as to what state the sourcecode is in (ie how applicable is this code to the platforms I have access to). 2) In the case where software resources indexed by OLAC are distributed in compiled form (ie not sourcecode) there's apparently not much more room to encode this information either. Apart from not strictly being something which belongs in a format.sourcecode element, the recommendation I assume would be that you could standardise this again by using the comment field, but the same consistency problem arises. Again, IMHO its relatively important to the end user of the OLAC search engine as to what state the sourcecode is in (ie can I just install and run or is it more complex) These two points may not represent large issues, but if the archives you are dealing with have a lot of software which ranges from source scripts in a range of languages, source for compilation for a range of compilers, and compiled "ready to run" applications, the granularity of this markup can be important and greatly assist with classification and indexation of resources in an appropriate manner. Additionally, for the less computer literate end users, this distinction is very important in them effectively locating a resource which is appropriate to their needs. Baden From sb at UNAGI.CIS.UPENN.EDU Mon Sep 16 21:39:54 2002 From: sb at UNAGI.CIS.UPENN.EDU (Steven Bird) Date: Mon, 16 Sep 2002 17:39:54 EDT Subject: A simpler format for OLAC vocabularies and schemes Message-ID: The OLAC metadata format provides two mechanisms for community- specific resource description. First, special refinements (metadata elements and corresponding vocabularies) support compatible description across the community. For example, the subject.language element, and the OLAC-Language vocabulary, permit all archives to identify subject language in the same manner. Second, every OLAC element permits an optional scheme attribute for use by sub-communities of OLAC. For example, the scholars at Academia Sinica can use their own naming scheme for Formosan languages and still package it up using the OLAC metadata container. This combination of standard refinements and user-defined schemes seems to offer a reasonable balance between interoperability and extensibility. Over the past month, Gary and I have been reviewing the design of OLAC metadata and have concluded that these parallel mechanisms are unnecessary. We think that with a *single* extension mechanism, OLAC can provide even better interoperability and extensibility. Moreover, we think this can be done with less administrative and technical infrastructure than before, making it still easier for archives to participate in OLAC. A. THE PRESENT SITUATION We begin with a quick review of how the two existing mechanisms work in OLAC metadata. First, community-specific refinements are represented using Dublin Core qualifications represented in XML. Here is an example for subject language: A resource about the Sikaiana language: This refinement permits focussed searching and better precision/recall than the corresponding Dublin Core element: The Sikaiana Language The OLAC version is flexible in that the code attribute is optional and that free-text can be put in the element content. The second mechanism is for user-defined schemes. All OLAC elements permit a scheme attribute, naming some third-party format or vocabulary that one or more OLAC archives use. For instance, the language listed by Ethnologue as Taroko (TRV) is known as Seediq in Academia Sinica, and OLAC would permit either or both of the following elements to appear in a metadata record for this language: Seediq Such a resource would be discovered under either naming scheme, and Academia Sinica could provide end-user services that rewarded any archive which employed its scheme for Formosan language identification. B. PROBLEMS WITH THE PRESENT SITUATION There are four general problems with the present situation. 1. Finalizing standard refinements. Our track record at developing controlled vocabularies over the past year indicates that we are not going to be able to finalize all the vocabularies that the OLAC metadata standard specifies in time for launching version 1.0 after our December workshop. Even if some vocabularies are finalized by December, the discussion may be reopened any time a new kind of archive joins OLAC. However, each vocabulary revision must currently be released as a new version of the entire OLAC metadata set, an unacceptable bureaucratic obstacle. 2. The artificial distinction between refinements and schemes. It is not clear when a putative refinement is important enough to be adopted as an OLAC standard, versus a user-defined scheme. Some of the refinements we recognize at present aren't as germane to the overall enterprise as others (e.g. operating system vs subject language), and may not have enough support to be retained. Conversely, the community is sure to develop new, useful ontologies that we don't support at present, and we would need to change the OLAC metadata standard in order to accommodate them. Promoting a user-defined scheme to an OLAC standard would necessitate a change in the XML representation, generating unnecessary work for all archives that support the scheme. 3. Duplication of technical support. User-defined schemes are likely to involve controlled vocabularies, with the same needs as OLAC vocabularies with respect to validation, translation to human-readable form in service providers, and dumb-down to Dublin Core for OAI interoperability. At present, the necessary infrastructure must be created twice over, once for each of the two mechanisms. 4. Idiosyncracies of XML schema. XML schema is used to define the well-formedness of OLAC records, but it is unable to express co-occurrence constraints between attribute values. This means that we cannot have more than one vocabulary for an element, forcing us to build structure into element names and multiply the names (e.g. Format.markup, Format.cpu, Format.os, ...). It is unfortunate that such a fundamental aspect of the OLAC XML format depends on a shortcoming of a tool that we may not be using for very long. In sum, the current model will be difficult to manage over the long term. Administratively, it encourages us to seek premature closure on issues of content description that can never be closed. Technically, it forces us to release new versions of the metadata format with each vocabulary revision, and forces us to create software infrastructure to support a mishmash of four syntactic extensions of DC: C. A NEW APPROACH In response to the problems outlined above, we would like to propose a new approach. The basic idea is simple: express all refinements, vocabularies and schemes using a uniform DC extension mechanism, and treat them all as recommendations instead of centrally-validated standards. The extension mechanism requires two attributes, called "extension" and "code", as shown below: It would be syntactically valid to simply use an extension in metadata without defining it. However, for extensions that will be used across the community, there must also be a formal definition that enumerates the corresponding controlled vocabulary in such a way that data providers and service providers alike can harvest the vocabulary from its definitive source. Thus another aspect of the new approach is an XML schema for the formal definition of an XDC extension. In the description section of the OAI Identify response, a data provider would declare which formally defined extensions it employs in its metadata. Extensions that enjoyed broad community support would be identified as OLAC Recommendations (following the existing OLAC Process). All OLAC archives would be encouraged to adopt them, in the sense that OLAC service providers would permit end-users to perform focussed searches over these extensions. In this way, archives that cooperate with the rest of the community are rewarded. Note that the approach isn't specific to language archives, so we're calling it extensible Dublin Core (XDC). An example of the syntax is available (an XML DTD, the equivalent XML schema, and an instance document): http://www.language-archives.org/XDC/0.1/ D. BENEFITS The new approach is technically simpler than the existing approach, and neatly solves the four problems we reported. 1. Finalizing standard refinements. The editors of OLAC vocabulary documents would be empowered to edit the vocabulary into the future, without concern for integration with new releases of the OLAC metadata format. 2. The artificial distinction between refinements and schemes. The syntactic distinction is gone, being replaced by a semantic one: is the vocabulary an OLAC Recommendation or not? Any archive or group of archives would be free to start using their own extensions without any formal registration. They could build a service to demonstrate the merit of their extension, thereby encouraging other archives to adopt it. Once broad support had been established, they could build a case for an OLAC Recommendation, leading to adoption across the community. 3. Duplication of technical support. With the single extension mechanism, we can provide uniform technical support for validation, translation and dumb-down. 4. Idiosyncracies of XML schema. We no longer give XML schema such sway in determining our XML syntax. Other XML and database technologies will be used to test that an extension is used correctly. In sum, the new approach is extensible, requiring no central administration of extensions, and no coordination of vocabulary revisions with new releases of the metadata format. The new approach also supports interoperability across the whole OLAC community (via OLAC Recommendations) and also among OLAC sub-communities that want to create their own special-purpose extensions. E. IMPLICATIONS We are still working out the technical implications for OLAC central services (e.g. registration, Vida, ORE, etc), and we will only be able to implement parts of this in time for the December meeting. As always, we would welcome donations of programmer time to help us. The short-term implication for OLAC archives is completely trivial, since only a simple syntactic change is required. The most important implication of this change is that it reduces the pressure to reach final agreement on OLAC vocabularies by our December workshop. But this isn't an excuse for us to slow down on that front. On the contrary, it frees us up to find working solutions for the key vocabularies that define us as a community. These will always be imperfect compromises that we can agree to work with and revise as necessary, well into the future. In sum, we hope we are not opening up a technical can of worms, but facilitating progress on the substantive issues, our common descriptive ontologies. Therefore, we encourage people to identify a particular extension that they would like to work on, and post their ideas and questions to this list (as Baden Hughes has just now done for sourcecode). You may also like to present your ideas at our workshop in December... -- So, what do you think? Do you agree with our proposals for (i) a syntactic simplification in our XML representation, and (ii) switching OLAC vocabularies from being centrally validated standards to recommendations? We would welcome your feedback. Steven Bird & Gary Simons From sb at UNAGI.CIS.UPENN.EDU Mon Sep 16 22:13:15 2002 From: sb at UNAGI.CIS.UPENN.EDU (Steven Bird) Date: Mon, 16 Sep 2002 18:13:15 EDT Subject: query about format.sourcecode In-Reply-To: Your mail dated Monday 16 September, 2002. Message-ID: Baden Hughes wrote: > I've got a query about matters related to the element format.sourcecode Its good to see discussion of software resources for a change, and I hope the maintainers of software archives (DFKI, TRACTOR) will contribute to this discussion. > Currently the spec at http://www.language-archives.org/OLAC/olacms.html > assumes that software resources indexed by OLAC will be in source code > (and hence appropriate entries will be made under this tagset). Not quite - all OLAC elements are optional, and some elements are simply inappropriate for some resources. Software distributed in binary form only doesn't need to be given any sourcecode descriptor. > The recommendation is currently: > > code="PROGRAMMING_LANGUAGE">Comments > > There are several questions I have about this. > > 1) Do we need to clarify this even further as there are apparently two > distinct options from the archive contents I've been working with). One > is where the sourcecode requires compilation, the other is where > sourcecode is essentially a script (or series of scripts). Any > information about the "state" of the source code is likely to be > inconsistent at best across archives, and I suspect even within a single > archive. IMHO its relatively important to the end user of the OLAC > search engine as to what state the sourcecode is in (ie how applicable > is this code to the platforms I have access to). Good, so the end-user requirement here is to be able to answer the question: "Can I run this software?" > 2) In the case where software resources indexed by OLAC are distributed > in compiled form (ie not sourcecode) there's apparently not much more > room to encode this information either. Apart from not strictly being > something which belongs in a format.sourcecode element, the > recommendation I assume would be that you could standardise this again > by using the comment field, but the same consistency problem arises. > Again, IMHO its relatively important to the end user of the OLAC search > engine as to what state the sourcecode is in (ie can I just install and > run or is it more complex) Right, so the end-user requirement here is to be able to answer the question: "How much effort will be required to get this running?" > These two points may not represent large issues, but if the archives you > are dealing with have a lot of software which ranges from source scripts > in a range of languages, source for compilation for a range of > compilers, and compiled "ready to run" applications, the granularity of > this markup can be important and greatly assist with classification and > indexation of resources in an appropriate manner. Additionally, for the > less computer literate end users, this distinction is very important in > them effectively locating a resource which is appropriate to their > needs. Absolutely. Currently we have vocabularies for Sourcecode, CPU, and OS. However, we can modify of scrap them if they don't serve our needs for resource description and discovery. Perhaps we need a new vocabulary that better describes the state of the sourcecode. One way to proceed here is for Baden (and any others) to identify the full range of end-user requirements (is it more than these two?) then propose vocabularies that best serve these requirements... -Steven -- Steven.Bird at ldc.upenn.edu http://www.ldc.upenn.edu/sb Assoc Director, LDC; Adj Assoc Prof, CIS & Linguistics Linguistic Data Consortium, University of Pennsylvania 3600 Market St, Suite 810, Philadelphia, PA 19104-2653 From baden at COMPULING.NET Fri Sep 20 11:57:22 2002 From: baden at COMPULING.NET (Baden Hughes) Date: Fri, 20 Sep 2002 21:57:22 +1000 Subject: proposed revision of format.os Message-ID: In working with several archives and drawing on other IT experience, I'd like to make some proposed changes to the format.os schema. --- 1.0 OLAC Schema for operating system types, Steven Bird, 4/27/01 1.1 draft OLAC Schema for operating system types, Baden Hughes, 19/09/02 --- You can also find this draft schema at http://www.compuling.net/projects/olac/190902-draft-olac-format.os.xsd These changes essentially add to the list if possible operating systems that I've encountered in classifying software. If preferred, I can circulate to the list. If there's others interested in working on this document, I'm more than happy to collaborate. Baden From baden at COMPULING.NET Fri Sep 20 12:15:53 2002 From: baden at COMPULING.NET (Baden Hughes) Date: Fri, 20 Sep 2002 22:15:53 +1000 Subject: proposed revision of format.cpu Message-ID: In working with several archives and drawing on other IT experience, I'd like to make some proposed changes to the format.cpu schema, (without regurgitating the entire history of computing in the process :-). --- 1.0 OLAC Schema for CPUs, Steven Bird, 5/7/01 1.1 draft OLAC Schema for CPU, Baden Hughes, 19/09/02 --- You can also find this draft schema at http://www.compuling.net/projects/olac/190902-draft-olac-format.cpu.xsd These changes essentially add to the list if possible operating systems that I've encountered in classifying cpu architectures relevant to language software. This includes some older mid-range style architectures and the latest handheld architectures. If preferred, I can circulate to the list. If there's others interested in working on this document, again I'm more than happy to collaborate. Baden From baden at COMPULING.NET Mon Sep 23 02:06:04 2002 From: baden at COMPULING.NET (Baden Hughes) Date: Mon, 23 Sep 2002 12:06:04 +1000 Subject: fproposed revision of format.sourcecode Message-ID: After a survey of several language archives, I'd like to propose some possible changes to the format.sourceode schema. Essentially this list is a list of programming languages of various types, in which software may be written. This list includes those found at: http://www.hypernews.org/HyperNews/get/computing/lang-list.html A draft can be found online at: http://www.compuling.net/projects/olac/220902-draft-olac-format.sourceco de.xsd Comments welcome. Baden From sb at UNAGI.CIS.UPENN.EDU Mon Sep 23 06:38:54 2002 From: sb at UNAGI.CIS.UPENN.EDU (Steven Bird) Date: Mon, 23 Sep 2002 02:38:54 EDT Subject: fproposed revision of format.sourcecode In-Reply-To: Your mail dated Monday 23 September, 2002. Message-ID: Baden Hughes wrote: > After a survey of several language archives, I'd like to propose some > possible changes to the format.sourceode schema. Essentially this list > is a list of programming languages of various types, in which software > may be written. This list includes those found at: > http://www.hypernews.org/HyperNews/get/computing/lang-list.html > > A draft can be found online at: > http://www.compuling.net/projects/olac/220902-draft-olac-format.sourcecode.xsd > > Comments welcome. This is great - a 20-fold increase on the number listed in my original 0.4 list. I grepped for a few obscure languages and they were all there. I'd like to raise two low-level technical issues, capitalization and whitespace. First, 99% of the codes are all-caps, even though some programming language names are not written like this (e.g. the list gives "PROLOG" but it should really be "Prolog"). However, rather than having to settle disputes about this question, I'd prefer it if we case-normalized everything. What do people think - should we standardize on uppercase? Second, Baden's list includes many items with spaces, e.g. "OBJECTIVE CAML". However, it seems desirable to limit the range of characters that can appear in a controlled vocabulary item (e.g. no accents) so that there is no transmission problems etc. In some contexts, such as hand-crafted CGI Get requests and HTML anchors, it is a pain to have to manually escape the space character. Could we live with a restriction of no spaces - i.e. replacing spaces with underscore? ** Note that neither of these issues is substantive, since each controlled vocabulary item will be associated with a human readable form (including translations into other languages). For example, in Dublin Core, there is a refinement named "hasVersion" with the human-readable label "Has Version". [http://www.dublincore.org/documents/dcmes-qualifiers/]. The plan is to do the same thing for OLAC vocabularies. -Steven From baden at COMPULING.NET Mon Sep 23 07:00:36 2002 From: baden at COMPULING.NET (Baden Hughes) Date: Mon, 23 Sep 2002 17:00:36 +1000 Subject: fproposed revision of format.sourcecode In-Reply-To: <200209230639.g8N6csL10762@unagi.cis.upenn.edu> Message-ID: I've updated the format.sourcecode schema draft with: -unnecessary whitespace removed -whitespace normalized to underscores in enumeration values -typos corrected You can find the updated list here: http://www.compuling.net/projects/olac/230902-draft-olac-format.sourceco de.xsd There's currently 285 programming languages listed on this schema. If any one has any more to add, drop me an email. Regards Baden > -----Original Message----- > From: Steven Bird [mailto:sb at unagi.cis.upenn.edu] > Sent: Monday, 23 September 2002 16:39 > To: baden at compuling.net > Cc: OLAC-IMPLEMENTERS at LISTSERV.LINGUISTLIST.ORG > Subject: Re: fproposed revision of format.sourcecode > > > > Baden Hughes wrote: > > After a survey of several language archives, I'd like to > propose some > > possible changes to the format.sourceode schema. > Essentially this list > > is a list of programming languages of various types, in > which software > > may be written. This list includes those found at: > > http://www.hypernews.org/HyperNews/get/computing/lang-list.html > > > > A draft can be found online at: > > > http://www.compuling.net/projects/olac/220902-> draft-olac-format.source > > code.xsd > > > > Comments welcome. > > This is great - a 20-fold increase on the number listed in my > original 0.4 list. I grepped for a few obscure languages and > they were all there. > > I'd like to raise two low-level technical issues, > capitalization and whitespace. > > First, 99% of the codes are all-caps, even though some > programming language names are not written like this (e.g. > the list gives "PROLOG" but it should really be "Prolog"). > However, rather than having to settle disputes about this > question, I'd prefer it if we case-normalized everything. > What do people think - should we standardize on uppercase? > > Second, Baden's list includes many items with spaces, e.g. > "OBJECTIVE CAML". However, it seems desirable to limit the > range of characters that can appear in a controlled > vocabulary item (e.g. no accents) so that there is no > transmission problems etc. In some contexts, such as > hand-crafted CGI Get requests and HTML anchors, it is a pain > to have to manually escape the space character. Could we > live with a restriction of no spaces - i.e. replacing spaces > with underscore? > > ** Note that neither of these issues is substantive, since > each controlled vocabulary item will be associated with a > human readable form (including translations into other > languages). For example, in Dublin Core, there is a > refinement named "hasVersion" with the human-readable label > "Has Version". > [http://www.dublincore.org/documents/dcmes-> qualifiers/]. > The > plan is to do the same thing for OLAC vocabularies. > > -Steven > From ruyng at GATE.SINICA.EDU.TW Mon Sep 23 10:29:42 2002 From: ruyng at GATE.SINICA.EDU.TW (Ru-Yng Chang) Date: Mon, 23 Sep 2002 06:29:42 -0400 Subject: fproposed revision of format.sourcecode Message-ID: Dear all, I find the difference between the draft and the code for program language of the standard of Chinese catalogue from National Central Library. http://datas.ncl.edu.tw/catweb/2-1-2a.htm(Big-5 encoding.) As the list. ---A----------- ADAPTIVE SERVER ENTERPRISE ADS-C AL ALPHARD ANALITIK ANNA APL2 ---B----------- BCY/B ---C----------- CADL CALM CANDE CCL CIP-L CLIPPER COLTS COMSKEE CONCURRENT_EUCLID ---D----------- D.L.LOGO DATAPLOT DBL DIST DYNAMO ---E----------- EDISON ELAN ---F----------- FOCUS FRED ---G----------- GHC GLYPNIR ---H----------- HYPERTALK ---I----------- IDL INFORMIX-4GL INTERPRESS ISETL ISP ---J----------- JAVA JAVA_APPLET (INCLUED IN JAVA) JAVA_WORKSHOP (INCLUED IN JAVA) JOSEF ---K----------- KHUWARIZMI KYLIX ---L----------- LISP LOGLAN_82 LOGO LOTUS_SCRIPT LUCID ---M----------- MACRO-11 MFC MODULA-2 MOUSE ---M----------- NATAL NPL ---O----------- OCCAM2 OPS5 ---P----------- PARAGON PARLOG PILOT PLEASE PL/1 PL/M51 PL/SQL POP11 PORTAL PSEUDOCODE PUCMAT ---Q----------- QEDIT ---R----------- ROSS ---S----------- S-ALGOL SGML SHELL SIMNET SMAL/80 SNAP SNOBOL SPECOL SPITBOL SQL/ORACLE STAROFFICE STEP_3 STEP_5 SURVIS ---T----------- T TIME_SERIES_PROCESSOR TURBO TUTOR ---U----------- UCSD_PASCAL UNIGRAPHICS UNISON_AUTHOR_LANGUAGE Ru-Yng From baden at COMPULING.NET Mon Sep 23 13:28:23 2002 From: baden at COMPULING.NET (Baden Hughes) Date: Mon, 23 Sep 2002 23:28:23 +1000 Subject: proposed revision of format.sourcecode In-Reply-To: Message-ID: An updated version of the format.sourcecode schema is now available online with additions from Ru-Yng Chang. http://www.compuling.net/projects/olac/240902-draft-olac-format.sourceco de.xsd Regards Baden From gary.holton at UAF.EDU Tue Sep 24 14:07:00 2002 From: gary.holton at UAF.EDU (Gary Holton) Date: Tue, 24 Sep 2002 10:07:00 -0400 Subject: A simpler format for OLAC vocabularies and schemes Message-ID: On Mon, 16 Sep 2002 17:39:54 EDT, Steven Bird wrote: >-- > >So, what do you think? Do you agree with our proposals for >(i) a syntactic simplification in our XML representation, and >(ii) switching OLAC vocabularies from being centrally validated >standards to recommendations? We would welcome your feedback. > Dear Steven & Gary, I haven't had much time to digest your proposal, but my initial reaction is very positive. Regarding (i), it is clear that a syntactic simplification is needed. I for one have never been able to keep straight refinements vs. schemes, and I don't think I'm alone here. And as you point out (ii), the real issue should be not whether a particular refinement (and associated vocabulary) has been officially adopted (mandated?), but rather whether a such a refinement is useful to the community. We can debate ontologies, but it is more difficult to debate usefulness without actually implementing a refinement. Your proposal would permit refinements ("extensions") to fit the needs of the community, so that useful solutions could evolve. I have often approached the metadata issue by trying to imagine what types of refinements and vocabularies would be useful to the end user. The difficulty is that we don't know enough about how the user will be searching, what they will be searching for, and what types of search facilities they will have. The best we can do at this point is make an educated guess and then watch closely to see how the refinements and vocabularies are actually used. That said, I think we have some very good guesses already and will certainly be able to recommend best practices by December. However, if we lock in the vocabularies then most archives will continue to have to support both an OLAC schema and a user-defined schema (as you point out). This would essentially remove the data provider from the loop, in that user-defined schemas would be viewed as idiosyncratic and non-standard. Allowing user-defined "extensions" would encourage innovation on the part of both data and service providers--innovation mediated by the end user. Any reactions from others? Gary Holton From sb at UNAGI.CIS.UPENN.EDU Tue Sep 24 22:25:11 2002 From: sb at UNAGI.CIS.UPENN.EDU (Steven Bird) Date: Tue, 24 Sep 2002 18:25:11 EDT Subject: A simpler format for OLAC vocabularies and schemes Message-ID: Thanks for the positive feedback. While we await more reactions let me jump in and say that Gary and I are working on a revised version of the proposal to bring it into line with new developments in the Dublin Core Metadata Initiative (DCMI). We'll preserve the new extensibility that people seem to appreciate, but also make syntactic changes to maximize interoperability with the wider digital libraries community. In the past we've basically gone it alone in working out how to represent our own DC qualifications in XML. However, the timing of these recommendations and our forthcoming workshop present us with a new opportunity to standardize our implementation. If you'd like to learn more about what's happening in DCMI with qualifiers and XML, please see the following article and the material it cites: Recommendations for XML Schema for Qualified Dublin Core Proposal to DC Architecture Working Group http://www.ukoln.ac.uk/metadata/dcmi/xmlschema/ Next week we'll circulate a proposal for how OLAC can conform with this. Note that this is only about XML implementation and not OLAC content. For those who only care about disseminating metadata, conformance with the DCMI recommendations will ensure maximal interoperability with the wider digital libraries community, so that your metadata pops up all over cyberspace. Back on the subject of extensibility... The key innovation in our recent proposal, that we'd still like more feedback on, is for the OLAC vocabularies to be changed from being centrally enforced standards to recommended practices. Under this model, any archive will be able to adopt and promulgate its favorite ontologies, while the OLAC Process is still used to identify community-agreed best practices that everyone should follow. For instance, consider the sourcecode vocabulary, which is only relevant to the software archives and which may need constant updates. Under the proposed model, the vocabulary wouldn't actually need to reside on the OLAC site; it could live wherever it could be easily maintained. However, the OLAC site would host the details of any associated working group, so that others could discover the group and contribute to the revision of the vocabulary. It would also host any associated OLAC recommendation, so that everyone would know that the OLAC community had adopted a certain vocabulary as best practice. -Steven From jcgood at SOCRATES.BERKELEY.EDU Tue Sep 24 23:23:50 2002 From: jcgood at SOCRATES.BERKELEY.EDU (Jeff Good) Date: Tue, 24 Sep 2002 16:23:50 -0700 Subject: A simpler format for OLAC vocabularies and schemes In-Reply-To: <200209162139.g8GLdsL27812@unagi.cis.upenn.edu> Message-ID: Hello, I wanted to say that I think the basic designs of the revisions proposed by Steven and Gary are very good suggestions. I completely agree with Gary Holton's points--so I won't repeat them. I thought I'd point out how I think these revisions can be usefully applied to some problems that the working group evaluating the linguistic types document. I think this new format will allow us to get past many issues which I thought may have been intractable. I guess I consider this to be a good "empirical" test of the proposal. The specific problem was that there are many cross-cutting ways to classify the "type" of a linguistic document. There's a sense in which a document focuses on a big sub-field of linguistics like phonology, morphology, etc. There's the basic structure of a document: dictionary, grammar, text (the term "macrostructure" can be used to describe this category). And then there are important "meso/micro-structure" aspects of documents---like the type of transcription used (free translation, interlinear, etc.) The original OLAC system encouraged us to create an ontology of document types which assumed that there was one "type" for a document, when, in reality, type is a multi-dimensional concept. As we realized this, we started to break down the types into the most important dimensions--like linguistic subject, basic structure, etc. But even then, there were problems of classification. For example, categories like "oratory", "narrative", "ludic" seemed appropriate for some linguistic documents--but it isn't immediately clear where they belong in a hierarchy of types (are they structural or content types? or are they something else?). It was possible to create a system of types which works, but I think many of our conceptual and implementational problems can be more cleanly solved by the new systems because of it extensibility. Specifically, rather than having to pigeonhole types into a few categories in a hierarchy, we can just propose a series of vocabularies corresponding to the potentially independent "type" parameters of a document--for example, a linguistic subject vocabulary, a document structural type vocabulary, a "discourse"-type vocabulary for things like "oratory" and "narrative". (For more detail on this, there are relevant recent posts, one from me, on the Metadata list.) Over time, I'm sure we'll find some of the vocabularies are more useful/used than others--and these can become OLAC recommended standard vocabularies. I think the real value of the new system will be that it is much more forgiving/flexible if we find we need to adapt our "type" categories in the future. Since Steven just posted about the idea that vocabularies be recommended practices, I'll say that I think that aspect of the proposal is also very helpful to working out a linguistic type vocabulary. One thing that at least I am convinced of in the discussion of "types" is that there is a counterexample to every generalization you can make about them. It may be the case that some counterexamples are minor enough that we can get away without a good classification for them. Or it might be the case that a counterexample is revealing a set of important omissions in the proposals. It's hard to tell without testing a lot of archives. A recommended, but not enforced, vocabulary would address this problem--as archivers encounter situations that aren't covered, they wouldn't be forced to "fit" their document into a category where it doesn't belong. This would not only promote the creation of needed new vocabulary items but also maintain the integrity of existing ones. Additionally, the idea of recommended vocabularies, plus a best practice standard, certainly is more in line with the general spirit of OLAC, and I think it would encourage more subcommunities to get involved and create vocabularies which they need. Jeff From baden at COMPULING.NET Wed Sep 25 04:31:35 2002 From: baden at COMPULING.NET (Baden Hughes) Date: Wed, 25 Sep 2002 14:31:35 +1000 Subject: A simpler format for OLAC vocabularies and schemes In-Reply-To: <200209162139.g8GLdsL27812@unagi.cis.upenn.edu> Message-ID: > So, what do you think? Do you agree with our proposals for > (i) a syntactic simplification in our XML representation, and The syntactic revision I personally agree with. Backwards and future compatibility is a significant factor and as such the new revisions I believe will make it easier to implement changes community wide and benefit archives who require special purpose extensions. > (ii) switching OLAC vocabularies from being centrally > validated standards to recommendations? We would welcome > your feedback. The proposal for recommendations rather than mandated standards seems to draw partially on both the W3C and IETF processes, whereby drafts or notes are submitted, reviewed, implemented and then reviewed with the view to standardisation if agreed as best practice. This process scales very well, and yet allows individuals or institutions the freedom to innovate whilst encouraging best practice once peer review of implementations has taken place. I think this is important to encourage innovation amongst participating archives who develop vocabularies to address their own needs first and then promote the benefits of these for wider community consideration. Baden From hdry at LINGUISTLIST.ORG Thu Sep 26 23:15:53 2002 From: hdry at LINGUISTLIST.ORG (Helen Dry) Date: Thu, 26 Sep 2002 19:15:53 -0400 Subject: A simpler format for OLAC vocabularies and schemes In-Reply-To: <200209162139.g8GLdsL27812@unagi.cis.upenn.edu> Message-ID: Hi, Steven (and everyone) Sorry to be so late responding to this proposal, but it's been a busy month. I am a little concerned about this proposal, perhaps because I don't understand exactly how the scheme system would work, so I thought I should make my comments and ask a few questions. Apologies if either or both are at a rather elementary level--I only seem to understand DC and XML for 10 minutes, right after I reread the websites. :-) It seems to me that there are two separable proposals here: (1) collapsing the formal mechanisms of refinement and scheme into the extension mechanism and (2) abandoning the attempt to reach general consensus on the descriptors that previously we were calling controlled vocabularies. The first may well be a welcome simplification, particularly administratively. (And I seem to have heard that it's the way the DC is going anyway.) The second seems worrisome to me for two primary reasons: (1) it seems counter to the overarching OLAC (and EMELD) goal of a unified--dare we say "standardized"?--mechanism for resource description and retrieval within the discipline; (2) on a practical level it may complicate--perhaps to a debilitating degree--the way that service providers implement search facilities. Of course, I'm thinking about LINGUIST here--we aren't an archive, so the potential benefits of being able to DESCRIBE resources via any scheme we might devise are not salient to me. What I'm worried about is how we're going to offer a search engine that makes use of all these variant descriptions. Particularly for something like linguistic data types--which is probably the main search field linguists will want to use--this seems almost like a return to the bad old days of the free text field, with the consequent loss of ability to identify and retrieve relevant resources. Now I imagine that there is some formal mechanism for relating schemes--I know you have a paragraph below about archives putting the schemes they use in their identifiers. But could you tell me exactly how this would work in practice? E.g., at the level of elements or terms? Would an archive that wants to use its own scheme have to provide a document showing how its categories relate to the categories in all the other schemes (e.g., that its "Seediq" was SIL's "Taroko.") Would the service provider have to construct a search engine that would first find and correlate all these documents, then search the multi-archive metadata for the resulting sets of terms? I'm sure it's possible--IF you could get everyone to provide scheme mappings--but it certainly seems unnecessarily complex. . . and, as I said, counter to the purpose of OLAC. I thought we were trying to settle on a unified way to describe linguistic resources, in order to offer the discipline the benefits of a level of standardization. Though this will come at the admitted expense of a certain amount of detail and precision, I feel confident that it will be accepted (accepted for what it is) if we persevere. After all, DC isn't perfect but people understand the utility of a restricted set of elements. It seems to me that, if the problem is that we may not come up with a proposal before December, we should either redouble our efforts and make the deadline or extend the deadline--not scrap the enterprise. Actually, with regard to linguistic data types, I feel confident we can come up with a reasonable proposal before the deadline. And I think it's important that we do so, since this is really one of the most important vocabularies--probably the most important for a large part of our audience, i.e. academic linguists. It's the main way that people, as opposed to machines, will want to search the archives. So, in sum, I agree with the arguments for using the extension mechanism and abandoning refinement and scheme. But I don't see the need to abandon the goal of reaching consensus on a single "OLAC-approved" set of linguistic data types, however that would be modeled in a world of "extensions" (not controlled vocabularies). Can we use extensions but not let in the world? BTW, under the proposal, will all the current refinements--e.g., "subject.language" now become schemes? But now I should stop and let someone knowledgable explain to me exactly how this scheme system will work. I'm all ears . . . . :-) Ready for enlightenment .... -Helen On 16 Sep 2002 at 17:39, Steven Bird wrote: The OLAC metadata format provides two mechanisms for community- specific resource description. First, special refinements (metadata elements and corresponding vocabularies) support compatible description across the community. For example, the subject.language element, and the OLAC-Language vocabulary, permit all archives to identify subject language in the same manner. Second, every OLAC element permits an optional scheme attribute for use by sub-communities of OLAC. For example, the scholars at Academia Sinica can use their own naming scheme for Formosan languages and still package it up using the OLAC metadata container. This combination of standard refinements and user-defined schemes seems to offer a reasonable balance between interoperability and extensibility. Over the past month, Gary and I have been reviewing the design of OLAC metadata and have concluded that these parallel mechanisms are unnecessary. We think that with a *single* extension mechanism, OLAC can provide even better interoperability and extensibility. Moreover, we think this can be done with less administrative and technical infrastructure than before, making it still easier for archives to participate in OLAC. A. THE PRESENT SITUATION We begin with a quick review of how the two existing mechanisms work in OLAC metadata. First, community-specific refinements are represented using Dublin Core qualifications represented in XML. Here is an example for subject language: A resource about the Sikaiana language: This refinement permits focussed searching and better precision/recall than the corresponding Dublin Core element: The Sikaiana Language The OLAC version is flexible in that the code attribute is optional and that free-text can be put in the element content. The second mechanism is for user-defined schemes. All OLAC elements permit a scheme attribute, naming some third-party format or vocabulary that one or more OLAC archives use. For instance, the language listed by Ethnologue as Taroko (TRV) is known as Seediq in Academia Sinica, and OLAC would permit either or both of the following elements to appear in a metadata record for this language: Seediq Such a resource would be discovered under either naming scheme, and Academia Sinica could provide end-user services that rewarded any archive which employed its scheme for Formosan language identification. B. PROBLEMS WITH THE PRESENT SITUATION There are four general problems with the present situation. 1. Finalizing standard refinements. Our track record at developing controlled vocabularies over the past year indicates that we are not going to be able to finalize all the vocabularies that the OLAC metadata standard specifies in time for launching version 1.0 after our December workshop. Even if some vocabularies are finalized by December, the discussion may be reopened any time a new kind of archive joins OLAC. However, each vocabulary revision must currently be released as a new version of the entire OLAC metadata set, an unacceptable bureaucratic obstacle. 2. The artificial distinction between refinements and schemes. It is not clear when a putative refinement is important enough to be adopted as an OLAC standard, versus a user-defined scheme. Some of the refinements we recognize at present aren't as germane to the overall enterprise as others (e.g. operating system vs subject language), and may not have enough support to be retained. Conversely, the community is sure to develop new, useful ontologies that we don't support at present, and we would need to change the OLAC metadata standard in order to accommodate them. Promoting a user-defined scheme to an OLAC standard would necessitate a change in the XML representation, generating unnecessary work for all archives that support the scheme. 3. Duplication of technical support. User-defined schemes are likely to involve controlled vocabularies, with the same needs as OLAC vocabularies with respect to validation, translation to human-readable form in service providers, and dumb-down to Dublin Core for OAI interoperability. At present, the necessary infrastructure must be created twice over, once for each of the two mechanisms. 4. Idiosyncracies of XML schema. XML schema is used to define the well-formedness of OLAC records, but it is unable to express co-occurrence constraints between attribute values. This means that we cannot have more than one vocabulary for an element, forcing us to build structure into element names and multiply the names (e.g. Format.markup, Format.cpu, Format.os, ...). It is unfortunate that such a fundamental aspect of the OLAC XML format depends on a shortcoming of a tool that we may not be using for very long. In sum, the current model will be difficult to manage over the long term. Administratively, it encourages us to seek premature closure on issues of content description that can never be closed. Technically, it forces us to release new versions of the metadata format with each vocabulary revision, and forces us to create software infrastructure to support a mishmash of four syntactic extensions of DC: C. A NEW APPROACH In response to the problems outlined above, we would like to propose a new approach. The basic idea is simple: express all refinements, vocabularies and schemes using a uniform DC extension mechanism, and treat them all as recommendations instead of centrally-validated standards. The extension mechanism requires two attributes, called "extension" and "code", as shown below: It would be syntactically valid to simply use an extension in metadata without defining it. However, for extensions that will be used across the community, there must also be a formal definition that enumerates the corresponding controlled vocabulary in such a way that data providers and service providers alike can harvest the vocabulary from its definitive source. Thus another aspect of the new approach is an XML schema for the formal definition of an XDC extension. In the description section of the OAI Identify response, a data provider would declare which formally defined extensions it employs in its metadata. Extensions that enjoyed broad community support would be identified as OLAC Recommendations (following the existing OLAC Process). All OLAC archives would be encouraged to adopt them, in the sense that OLAC service providers would permit end-users to perform focussed searches over these extensions. In this way, archives that cooperate with the rest of the community are rewarded. Note that the approach isn't specific to language archives, so we're calling it extensible Dublin Core (XDC). An example of the syntax is available (an XML DTD, the equivalent XML schema, and an instance document): http://www.language-archives.org/XDC/0.1/ D. BENEFITS The new approach is technically simpler than the existing approach, and neatly solves the four problems we reported. 1. Finalizing standard refinements. The editors of OLAC vocabulary documents would be empowered to edit the vocabulary into the future, without concern for integration with new releases of the OLAC metadata format. 2. The artificial distinction between refinements and schemes. The syntactic distinction is gone, being replaced by a semantic one: is the vocabulary an OLAC Recommendation or not? Any archive or group of archives would be free to start using their own extensions without any formal registration. They could build a service to demonstrate the merit of their extension, thereby encouraging other archives to adopt it. Once broad support had been established, they could build a case for an OLAC Recommendation, leading to adoption across the community. 3. Duplication of technical support. With the single extension mechanism, we can provide uniform technical support for validation, translation and dumb-down. 4. Idiosyncracies of XML schema. We no longer give XML schema such sway in determining our XML syntax. Other XML and database technologies will be used to test that an extension is used correctly. In sum, the new approach is extensible, requiring no central administration of extensions, and no coordination of vocabulary revisions with new releases of the metadata format. The new approach also supports interoperability across the whole OLAC community (via OLAC Recommendations) and also among OLAC sub-communities that want to create their own special-purpose extensions. E. IMPLICATIONS We are still working out the technical implications for OLAC central services (e.g. registration, Vida, ORE, etc), and we will only be able to implement parts of this in time for the December meeting. As always, we would welcome donations of programmer time to help us. The short-term implication for OLAC archives is completely trivial, since only a simple syntactic change is required. The most important implication of this change is that it reduces the pressure to reach final agreement on OLAC vocabularies by our December workshop. But this isn't an excuse for us to slow down on that front. On the contrary, it frees us up to find working solutions for the key vocabularies that define us as a community. These will always be imperfect compromises that we can agree to work with and revise as necessary, well into the future. In sum, we hope we are not opening up a technical can of worms, but facilitating progress on the substantive issues, our common descriptive ontologies. Therefore, we encourage people to identify a particular extension that they would like to work on, and post their ideas and questions to this list (as Baden Hughes has just now done for sourcecode). You may also like to present your ideas at our workshop in December... -- So, what do you think? Do you agree with our proposals for (i) a syntactic simplification in our XML representation, and (ii) switching OLAC vocabularies from being centrally validated standards to recommendations? We would welcome your feedback. Steven Bird & Gary Simons From hdry at LINGUISTLIST.ORG Thu Sep 26 23:36:44 2002 From: hdry at LINGUISTLIST.ORG (Helen Dry) Date: Thu, 26 Sep 2002 19:36:44 -0400 Subject: A simpler format for OLAC vocabularies and schemes In-Reply-To: Message-ID: Hi, Gary (and everyone), I've just sent a long posting to the list explaining some of my problems with Steven's & Gary's proposal, so all I want to do here is respond briefly. I completely agree with your point about the value of syntactic simplification. But I'm not sure about the second point--reducing all OLAC vocabularies to recommendations. It's interesting where our opinions diverge--i.e., you see the benefits to the archive, which may already have a user-defined scheme, and I see the possible problems for the general service provider, which may not be able to handle multiple user-defined schemes in an efficient way. Perhaps OLAC can handle this problem by making STRONG recommendations . . . but in that case, I don't see the real difference between recommendations and a centrally validated standard . . . except for the fact that OLAC wouldn't have to re-publish all the metadata whenever a recommendation changed. I suppose this would be an administrative advantage-- but enough of a one to lose the potential benefits of standardization??? I'm waiting to be convinced.... -Helen On 24 Sep 2002 at 10:07, Gary Holton wrote: On Mon, 16 Sep 2002 17:39:54 EDT, Steven Bird wrote: >-- > >So, what do you think? Do you agree with our proposals for >(i) a syntactic simplification in our XML representation, and >(ii) switching OLAC vocabularies from being centrally validated >standards to recommendations? We would welcome your feedback. > Dear Steven & Gary, I haven't had much time to digest your proposal, but my initial reaction is very positive. Regarding (i), it is clear that a syntactic simplification is needed. I for one have never been able to keep straight refinements vs. schemes, and I don't think I'm alone here. And as you point out (ii), the real issue should be not whether a particular refinement (and associated vocabulary) has been officially adopted (mandated?), but rather whether a such a refinement is useful to the community. We can debate ontologies, but it is more difficult to debate usefulness without actually implementing a refinement. Your proposal would permit refinements ("extensions") to fit the needs of the community, so that useful solutions could evolve. I have often approached the metadata issue by trying to imagine what types of refinements and vocabularies would be useful to the end user. The difficulty is that we don't know enough about how the user will be searching, what they will be searching for, and what types of search facilities they will have. The best we can do at this point is make an educated guess and then watch closely to see how the refinements and vocabularies are actually used. That said, I think we have some very good guesses already and will certainly be able to recommend best practices by December. However, if we lock in the vocabularies then most archives will continue to have to support both an OLAC schema and a user-defined schema (as you point out). This would essentially remove the data provider from the loop, in that user-defined schemas would be viewed as idiosyncratic and non-standard. Allowing user-defined "extensions" would encourage innovation on the part of both data and service providers--innovation mediated by the end user. Any reactions from others? Gary Holton From Gary_Simons at SIL.ORG Fri Sep 27 00:24:30 2002 From: Gary_Simons at SIL.ORG (Gary Simons) Date: Thu, 26 Sep 2002 19:24:30 -0500 Subject: A simpler format for OLAC vocabularies and schemes Message-ID: Helen, You hit the nail on the head when you observe: "in that case, I don't see the real difference between recommendations and a centrally validated standard". It was that same observation, but coming from the point of view of our status quo, that has been a key part of the motivation as Steven and I have been thinking about what our version 1.0 standard should look like. In version 0.4 we have a centrally validated and mandated standard, but it has built-in optionality. For instance, it is our standard to use SIL and Linguist codes to identify languages precisely, but data providers also have the option of just providing free text. Thus the standard is currently not requiring language codes but only recommending them as best practice, and an examination of the harvested records from our 20 or so participating data providers reveals the fact that many sites are not now using codes. Our proposal to take the controlled vocabularies out of the standard and to treat them as best practice recommendations thus does not really change the current reality. In fact, it probably gives a better reflection of the reality. One key advantage from the point of view of managing the infrastructure is that it will not be necessary to change the standard when controlled vocabularies are changed or added. The metadata standard would just specify the structure of the container record and the mechanism for defining metadata extensions and would be very static. Each controlled vocabulary would be managed separately in an independent document and in a formal extension definition that would supply downloadable code sets so that extension data can still be centrally validated. When the community reaches a consensus that a particular vocabulary should be used when applicable, then it would become a community Recommendation and our default harvester would support it. Service providers would exploit it (such as Linguist is now doing with searching by language) and that would show data providers who are not yet using the vocabulary the benefits of using it. We could even have a "Recommended practice report card" that would show which recommended extensions an archive is using and which it is not. Thus Steven and I are assuming that the end result of this change would not weaken compliance to standardized vocabularies (which is already optional), but that it would make it much easier to manage changes to vocabularies and to experiment with specialized vocabularies. I hope that helps to clarify where we are coming from. -Gary Simons Helen Dry Sent by: OLAC Implementers List To: .ORG cc: Subject: Re: A simpler format for OLAC 09/26/02 06:36 PM vocabularies and schemes Please respond to Open Language Archives Community Implementers List Hi, Gary (and everyone), I've just sent a long posting to the list explaining some of my problems with Steven's & Gary's proposal, so all I want to do here is respond briefly. I completely agree with your point about the value of syntactic simplification. But I'm not sure about the second point--reducing all OLAC vocabularies to recommendations. It's interesting where our opinions diverge--i.e., you see the benefits to the archive, which may already have a user-defined scheme, and I see the possible problems for the general service provider, which may not be able to handle multiple user-defined schemes in an efficient way. Perhaps OLAC can handle this problem by making STRONG recommendations . . . but in that case, I don't see the real difference between recommendations and a centrally validated standard . . . except for the fact that OLAC wouldn't have to re-publish all the metadata whenever a recommendation changed. I suppose this would be an administrative advantage-- but enough of a one to lose the potential benefits of standardization??? I'm waiting to be convinced.... -Helen On 24 Sep 2002 at 10:07, Gary Holton wrote: On Mon, 16 Sep 2002 17:39:54 EDT, Steven Bird wrote: >-- > >So, what do you think? Do you agree with our proposals for >(i) a syntactic simplification in our XML representation, and >(ii) switching OLAC vocabularies from being centrally validated >standards to recommendations? We would welcome your feedback. > Dear Steven & Gary, I haven't had much time to digest your proposal, but my initial reaction is very positive. Regarding (i), it is clear that a syntactic simplification is needed. I for one have never been able to keep straight refinements vs. schemes, and I don't think I'm alone here. And as you point out (ii), the real issue should be not whether a particular refinement (and associated vocabulary) has been officially adopted (mandated?), but rather whether a such a refinement is useful to the community. We can debate ontologies, but it is more difficult to debate usefulness without actually implementing a refinement. Your proposal would permit refinements ("extensions") to fit the needs of the community, so that useful solutions could evolve. I have often approached the metadata issue by trying to imagine what types of refinements and vocabularies would be useful to the end user. The difficulty is that we don't know enough about how the user will be searching, what they will be searching for, and what types of search facilities they will have. The best we can do at this point is make an educated guess and then watch closely to see how the refinements and vocabularies are actually used. That said, I think we have some very good guesses already and will certainly be able to recommend best practices by December. However, if we lock in the vocabularies then most archives will continue to have to support both an OLAC schema and a user-defined schema (as you point out). This would essentially remove the data provider from the loop, in that user-defined schemas would be viewed as idiosyncratic and non-standard. Allowing user-defined "extensions" would encourage innovation on the part of both data and service providers--innovation mediated by the end user. Any reactions from others? Gary Holton From hdry at LINGUISTLIST.ORG Fri Sep 27 16:46:45 2002 From: hdry at LINGUISTLIST.ORG (Helen Aristar Dry) Date: Fri, 27 Sep 2002 12:46:45 -0400 Subject: A simpler format for OLAC vocabularies and schemes In-Reply-To: Message-ID: Hi, Gary, Yes, I take your point that we can't force compliance; and, in general, I'd be all for letting standards evolve from usage. But actually, from the point of view of the LINGUIST service provider, the languages example isn't a heartening one. What our programmer had to do to search harvested OLAC metadata by subject language is write a special program that translates any text entry in the subject language field into the SIL code. This is possible to do with languages only because we have the Ethnologue name and alternate name tables on the site, and therefore we have a list of almost all the language names that any site might be using. It's still a lot of work, and we're no doubt missing or misclassifying the subject languages of a lot of records. Nevertheless, we do have a search engine that is using Ethnologue codes to identify resources by subject.language, thereby demonstrating the utility of this recommendation. But what are we going to do for linguistic data type and all the other erstwhile controlled vocabularies?? There's no "alternate name" reference for extensions (at least not as far as I know), such that we could use it to write a translation program . . even if it were feasible to translate every relevant value in every metadata record. And it makes no sense to set up search facilities that use the recommended vocabulary if there's no data classified by it--getting a lot of "not found" messages will discourage users from using the recommended vocabulary, not encourage it. So our search engine is not going to be any help in promulgating these recommendations. Sigh. I realize that mandating a controlled vocabulary wouldn't ensure that archives used it. Perhaps it would give them a little more impetus, however. And it would certainly be nice if each archive would "translate" its user-defined metadata into the recommended OLAC vocabulary, rather than leaving the service provider to figure out how to do it for multiple archives, each with its own idiosyncratic and undocumented set of extensions. I'm still hoping that you and Steven will come up with some bright ideas about how to help/encourage/convince archives to do this . . . Sorry to be negative. You know I think OLAC is the best thing since sliced bread. . . . I'm just having some trouble figuring out how we're going to cope with the new-fangled slices.... All the best, -Helen Date sent: Thu, 26 Sep 2002 19:24:30 -0500 Send reply to: Open Language Archives Community Implementers List From: Gary Simons Subject: Re: A simpler format for OLAC vocabularies and schemes To: OLAC-IMPLEMENTERS at LISTSERV.LINGUISTLIST.ORG > Helen, > > You hit the nail on the head when you observe: "in that case, I don't see > the real difference between recommendations and a centrally validated > standard". It was that same observation, but coming from the point of view > of our status quo, that has been a key part of the motivation as Steven and > I have been thinking about what our version 1.0 standard should look like. > > In version 0.4 we have a centrally validated and mandated standard, but it > has built-in optionality. For instance, it is our standard to use SIL and > Linguist codes to identify languages precisely, but data providers also > have the option of just providing free text. Thus the standard is > currently not requiring language codes but only recommending them as best > practice, and an examination of the harvested records from our 20 or so > participating data providers reveals the fact that many sites are not now > using codes. > > Our proposal to take the controlled vocabularies out of the standard and to > treat them as best practice recommendations thus does not really change the > current reality. In fact, it probably gives a better reflection of the > reality. One key advantage from the point of view of managing the > infrastructure is that it will not be necessary to change the standard when > controlled vocabularies are changed or added. The metadata standard would > just specify the structure of the container record and the mechanism for > defining metadata extensions and would be very static. Each controlled > vocabulary would be managed separately in an independent document and in a > formal extension definition that would supply downloadable code sets so > that extension data can still be centrally validated. When the community > reaches a consensus that a particular vocabulary should be used when > applicable, then it would become a community Recommendation and our default > harvester would support it. Service providers would exploit it (such as > Linguist is now doing with searching by language) and that would show data > providers who are not yet using the vocabulary the benefits of using it. > We could even have a "Recommended practice report card" that would show > which recommended extensions an archive is using and which it is not. > > Thus Steven and I are assuming that the end result of this change would not > weaken compliance to standardized vocabularies (which is already optional), > but that it would make it much easier to manage changes to vocabularies and > to experiment with specialized vocabularies. > > I hope that helps to clarify where we are coming from. > > -Gary Simons > > > > > > Helen Dry > Sent by: OLAC Implementers List To: > STLIST.ORG> .ORG > cc: > Subject: Re: A simpler format fo r OLAC > 09/26/02 06:36 PM vocabularies and schemes > Please respond to Open Language > Archives Community Implementers > List > > > > > > Hi, Gary (and everyone), > > I've just sent a long posting to the list explaining some of my problems > with Steven's > & Gary's proposal, so all I want to do here is respond briefly. I > completely agree with > your point about the value of syntactic simplification. But I'm not sure > about the > second point--reducing all OLAC vocabularies to recommendations. It's > interesting > where our opinions diverge--i.e., you see the benefits to the archive, > which may > already have a user-defined scheme, and I see the possible problems for the > general service provider, which may not be able to handle multiple > user-defined > schemes in an efficient way. Perhaps OLAC can handle this problem by > making > STRONG recommendations . . . but in that case, I don't see the real > difference > between recommendations and a centrally validated standard . . . except for > the fact > that OLAC wouldn't have to re-publish all the metadata whenever a > recommendation changed. I suppose this would be an administrative > advantage-- > but enough of a one to lose the potential benefits of standardization??? > I'm waiting > to be convinced.... > > -Helen > > > > On 24 Sep 2002 at 10:07, Gary Holton wrote: > > On Mon, 16 Sep 2002 17:39:54 EDT, Steven Bird > wrote: > >-- > > > >So, what do you think? Do you agree with our proposals for > >(i) a syntactic simplification in our XML representation, and > >(ii) switching OLAC vocabularies from being centrally validated > >standards to recommendations? We would welcome your feedback. > > > > > Dear Steven & Gary, > > I haven't had much time to digest your proposal, but my initial reaction is > very positive. Regarding (i), it is clear that a syntactic simplification > is needed. I for one have never been able to keep straight refinements vs. > schemes, and I don't think I'm alone here. And as you point out (ii), the > real issue should be not whether a particular refinement (and associated > vocabulary) has been officially adopted (mandated?), but rather whether a > such a refinement is useful to the community. We can debate ontologies, but > it is more difficult to debate usefulness without actually implementing a > refinement. Your proposal would permit refinements ("extensions") to fit > the needs of the community, so that useful solutions could evolve. > > I have often approached the metadata issue by trying to imagine what types > of refinements and vocabularies would be useful to the end user. The > difficulty is that we don't know enough about how the user will be > searching, what they will be searching for, and what types of search > facilities they will have. The best we can do at this point is make an > educated guess and then watch closely to see how the refinements and > vocabularies are actually used. That said, I think we have some very good > guesses already and will certainly be able to recommend best practices by > December. However, if we lock in the vocabularies then most archives will > continue to have to support both an OLAC schema and a user-defined schema > (as you point out). This would essentially remove the data provider from > the loop, in that user-defined schemas would be viewed as idiosyncratic and > non-standard. Allowing user-defined "extensions" would encourage innovation > on the part of both data and service providers--innovation mediated by the > end user. > > Any reactions from others? > > Gary Holton From baden at COMPULING.NET Mon Sep 16 13:15:50 2002 From: baden at COMPULING.NET (Baden Hughes) Date: Mon, 16 Sep 2002 23:15:50 +1000 Subject: query about format.sourcecode Message-ID: Hi - I've got a query about matters related to the element format.sourcecode Currently the spec at http://www.language-archives.org/OLAC/olacms.html assumes that software resources indexed by OLAC will be in source code (and hence appropriate entries will be made under this tagset). The recommendation is currently: Comments There are several questions I have about this. 1) Do we need to clarify this even further as there are apparently two distinct options from the archive contents I've been working with). One is where the sourcecode requires compilation, the other is where sourcecode is essentially a script (or series of scripts). Any information about the "state" of the source code is likely to be inconsistent at best across archives, and I suspect even within a single archive. IMHO its relatively important to the end user of the OLAC search engine as to what state the sourcecode is in (ie how applicable is this code to the platforms I have access to). 2) In the case where software resources indexed by OLAC are distributed in compiled form (ie not sourcecode) there's apparently not much more room to encode this information either. Apart from not strictly being something which belongs in a format.sourcecode element, the recommendation I assume would be that you could standardise this again by using the comment field, but the same consistency problem arises. Again, IMHO its relatively important to the end user of the OLAC search engine as to what state the sourcecode is in (ie can I just install and run or is it more complex) These two points may not represent large issues, but if the archives you are dealing with have a lot of software which ranges from source scripts in a range of languages, source for compilation for a range of compilers, and compiled "ready to run" applications, the granularity of this markup can be important and greatly assist with classification and indexation of resources in an appropriate manner. Additionally, for the less computer literate end users, this distinction is very important in them effectively locating a resource which is appropriate to their needs. Baden From sb at UNAGI.CIS.UPENN.EDU Mon Sep 16 21:39:54 2002 From: sb at UNAGI.CIS.UPENN.EDU (Steven Bird) Date: Mon, 16 Sep 2002 17:39:54 EDT Subject: A simpler format for OLAC vocabularies and schemes Message-ID: The OLAC metadata format provides two mechanisms for community- specific resource description. First, special refinements (metadata elements and corresponding vocabularies) support compatible description across the community. For example, the subject.language element, and the OLAC-Language vocabulary, permit all archives to identify subject language in the same manner. Second, every OLAC element permits an optional scheme attribute for use by sub-communities of OLAC. For example, the scholars at Academia Sinica can use their own naming scheme for Formosan languages and still package it up using the OLAC metadata container. This combination of standard refinements and user-defined schemes seems to offer a reasonable balance between interoperability and extensibility. Over the past month, Gary and I have been reviewing the design of OLAC metadata and have concluded that these parallel mechanisms are unnecessary. We think that with a *single* extension mechanism, OLAC can provide even better interoperability and extensibility. Moreover, we think this can be done with less administrative and technical infrastructure than before, making it still easier for archives to participate in OLAC. A. THE PRESENT SITUATION We begin with a quick review of how the two existing mechanisms work in OLAC metadata. First, community-specific refinements are represented using Dublin Core qualifications represented in XML. Here is an example for subject language: A resource about the Sikaiana language: This refinement permits focussed searching and better precision/recall than the corresponding Dublin Core element: The Sikaiana Language The OLAC version is flexible in that the code attribute is optional and that free-text can be put in the element content. The second mechanism is for user-defined schemes. All OLAC elements permit a scheme attribute, naming some third-party format or vocabulary that one or more OLAC archives use. For instance, the language listed by Ethnologue as Taroko (TRV) is known as Seediq in Academia Sinica, and OLAC would permit either or both of the following elements to appear in a metadata record for this language: Seediq Such a resource would be discovered under either naming scheme, and Academia Sinica could provide end-user services that rewarded any archive which employed its scheme for Formosan language identification. B. PROBLEMS WITH THE PRESENT SITUATION There are four general problems with the present situation. 1. Finalizing standard refinements. Our track record at developing controlled vocabularies over the past year indicates that we are not going to be able to finalize all the vocabularies that the OLAC metadata standard specifies in time for launching version 1.0 after our December workshop. Even if some vocabularies are finalized by December, the discussion may be reopened any time a new kind of archive joins OLAC. However, each vocabulary revision must currently be released as a new version of the entire OLAC metadata set, an unacceptable bureaucratic obstacle. 2. The artificial distinction between refinements and schemes. It is not clear when a putative refinement is important enough to be adopted as an OLAC standard, versus a user-defined scheme. Some of the refinements we recognize at present aren't as germane to the overall enterprise as others (e.g. operating system vs subject language), and may not have enough support to be retained. Conversely, the community is sure to develop new, useful ontologies that we don't support at present, and we would need to change the OLAC metadata standard in order to accommodate them. Promoting a user-defined scheme to an OLAC standard would necessitate a change in the XML representation, generating unnecessary work for all archives that support the scheme. 3. Duplication of technical support. User-defined schemes are likely to involve controlled vocabularies, with the same needs as OLAC vocabularies with respect to validation, translation to human-readable form in service providers, and dumb-down to Dublin Core for OAI interoperability. At present, the necessary infrastructure must be created twice over, once for each of the two mechanisms. 4. Idiosyncracies of XML schema. XML schema is used to define the well-formedness of OLAC records, but it is unable to express co-occurrence constraints between attribute values. This means that we cannot have more than one vocabulary for an element, forcing us to build structure into element names and multiply the names (e.g. Format.markup, Format.cpu, Format.os, ...). It is unfortunate that such a fundamental aspect of the OLAC XML format depends on a shortcoming of a tool that we may not be using for very long. In sum, the current model will be difficult to manage over the long term. Administratively, it encourages us to seek premature closure on issues of content description that can never be closed. Technically, it forces us to release new versions of the metadata format with each vocabulary revision, and forces us to create software infrastructure to support a mishmash of four syntactic extensions of DC: C. A NEW APPROACH In response to the problems outlined above, we would like to propose a new approach. The basic idea is simple: express all refinements, vocabularies and schemes using a uniform DC extension mechanism, and treat them all as recommendations instead of centrally-validated standards. The extension mechanism requires two attributes, called "extension" and "code", as shown below: It would be syntactically valid to simply use an extension in metadata without defining it. However, for extensions that will be used across the community, there must also be a formal definition that enumerates the corresponding controlled vocabulary in such a way that data providers and service providers alike can harvest the vocabulary from its definitive source. Thus another aspect of the new approach is an XML schema for the formal definition of an XDC extension. In the description section of the OAI Identify response, a data provider would declare which formally defined extensions it employs in its metadata. Extensions that enjoyed broad community support would be identified as OLAC Recommendations (following the existing OLAC Process). All OLAC archives would be encouraged to adopt them, in the sense that OLAC service providers would permit end-users to perform focussed searches over these extensions. In this way, archives that cooperate with the rest of the community are rewarded. Note that the approach isn't specific to language archives, so we're calling it extensible Dublin Core (XDC). An example of the syntax is available (an XML DTD, the equivalent XML schema, and an instance document): http://www.language-archives.org/XDC/0.1/ D. BENEFITS The new approach is technically simpler than the existing approach, and neatly solves the four problems we reported. 1. Finalizing standard refinements. The editors of OLAC vocabulary documents would be empowered to edit the vocabulary into the future, without concern for integration with new releases of the OLAC metadata format. 2. The artificial distinction between refinements and schemes. The syntactic distinction is gone, being replaced by a semantic one: is the vocabulary an OLAC Recommendation or not? Any archive or group of archives would be free to start using their own extensions without any formal registration. They could build a service to demonstrate the merit of their extension, thereby encouraging other archives to adopt it. Once broad support had been established, they could build a case for an OLAC Recommendation, leading to adoption across the community. 3. Duplication of technical support. With the single extension mechanism, we can provide uniform technical support for validation, translation and dumb-down. 4. Idiosyncracies of XML schema. We no longer give XML schema such sway in determining our XML syntax. Other XML and database technologies will be used to test that an extension is used correctly. In sum, the new approach is extensible, requiring no central administration of extensions, and no coordination of vocabulary revisions with new releases of the metadata format. The new approach also supports interoperability across the whole OLAC community (via OLAC Recommendations) and also among OLAC sub-communities that want to create their own special-purpose extensions. E. IMPLICATIONS We are still working out the technical implications for OLAC central services (e.g. registration, Vida, ORE, etc), and we will only be able to implement parts of this in time for the December meeting. As always, we would welcome donations of programmer time to help us. The short-term implication for OLAC archives is completely trivial, since only a simple syntactic change is required. The most important implication of this change is that it reduces the pressure to reach final agreement on OLAC vocabularies by our December workshop. But this isn't an excuse for us to slow down on that front. On the contrary, it frees us up to find working solutions for the key vocabularies that define us as a community. These will always be imperfect compromises that we can agree to work with and revise as necessary, well into the future. In sum, we hope we are not opening up a technical can of worms, but facilitating progress on the substantive issues, our common descriptive ontologies. Therefore, we encourage people to identify a particular extension that they would like to work on, and post their ideas and questions to this list (as Baden Hughes has just now done for sourcecode). You may also like to present your ideas at our workshop in December... -- So, what do you think? Do you agree with our proposals for (i) a syntactic simplification in our XML representation, and (ii) switching OLAC vocabularies from being centrally validated standards to recommendations? We would welcome your feedback. Steven Bird & Gary Simons From sb at UNAGI.CIS.UPENN.EDU Mon Sep 16 22:13:15 2002 From: sb at UNAGI.CIS.UPENN.EDU (Steven Bird) Date: Mon, 16 Sep 2002 18:13:15 EDT Subject: query about format.sourcecode In-Reply-To: Your mail dated Monday 16 September, 2002. Message-ID: Baden Hughes wrote: > I've got a query about matters related to the element format.sourcecode Its good to see discussion of software resources for a change, and I hope the maintainers of software archives (DFKI, TRACTOR) will contribute to this discussion. > Currently the spec at http://www.language-archives.org/OLAC/olacms.html > assumes that software resources indexed by OLAC will be in source code > (and hence appropriate entries will be made under this tagset). Not quite - all OLAC elements are optional, and some elements are simply inappropriate for some resources. Software distributed in binary form only doesn't need to be given any sourcecode descriptor. > The recommendation is currently: > > code="PROGRAMMING_LANGUAGE">Comments > > There are several questions I have about this. > > 1) Do we need to clarify this even further as there are apparently two > distinct options from the archive contents I've been working with). One > is where the sourcecode requires compilation, the other is where > sourcecode is essentially a script (or series of scripts). Any > information about the "state" of the source code is likely to be > inconsistent at best across archives, and I suspect even within a single > archive. IMHO its relatively important to the end user of the OLAC > search engine as to what state the sourcecode is in (ie how applicable > is this code to the platforms I have access to). Good, so the end-user requirement here is to be able to answer the question: "Can I run this software?" > 2) In the case where software resources indexed by OLAC are distributed > in compiled form (ie not sourcecode) there's apparently not much more > room to encode this information either. Apart from not strictly being > something which belongs in a format.sourcecode element, the > recommendation I assume would be that you could standardise this again > by using the comment field, but the same consistency problem arises. > Again, IMHO its relatively important to the end user of the OLAC search > engine as to what state the sourcecode is in (ie can I just install and > run or is it more complex) Right, so the end-user requirement here is to be able to answer the question: "How much effort will be required to get this running?" > These two points may not represent large issues, but if the archives you > are dealing with have a lot of software which ranges from source scripts > in a range of languages, source for compilation for a range of > compilers, and compiled "ready to run" applications, the granularity of > this markup can be important and greatly assist with classification and > indexation of resources in an appropriate manner. Additionally, for the > less computer literate end users, this distinction is very important in > them effectively locating a resource which is appropriate to their > needs. Absolutely. Currently we have vocabularies for Sourcecode, CPU, and OS. However, we can modify of scrap them if they don't serve our needs for resource description and discovery. Perhaps we need a new vocabulary that better describes the state of the sourcecode. One way to proceed here is for Baden (and any others) to identify the full range of end-user requirements (is it more than these two?) then propose vocabularies that best serve these requirements... -Steven -- Steven.Bird at ldc.upenn.edu http://www.ldc.upenn.edu/sb Assoc Director, LDC; Adj Assoc Prof, CIS & Linguistics Linguistic Data Consortium, University of Pennsylvania 3600 Market St, Suite 810, Philadelphia, PA 19104-2653 From baden at COMPULING.NET Fri Sep 20 11:57:22 2002 From: baden at COMPULING.NET (Baden Hughes) Date: Fri, 20 Sep 2002 21:57:22 +1000 Subject: proposed revision of format.os Message-ID: In working with several archives and drawing on other IT experience, I'd like to make some proposed changes to the format.os schema. --- 1.0 OLAC Schema for operating system types, Steven Bird, 4/27/01 1.1 draft OLAC Schema for operating system types, Baden Hughes, 19/09/02 --- You can also find this draft schema at http://www.compuling.net/projects/olac/190902-draft-olac-format.os.xsd These changes essentially add to the list if possible operating systems that I've encountered in classifying software. If preferred, I can circulate to the list. If there's others interested in working on this document, I'm more than happy to collaborate. Baden From baden at COMPULING.NET Fri Sep 20 12:15:53 2002 From: baden at COMPULING.NET (Baden Hughes) Date: Fri, 20 Sep 2002 22:15:53 +1000 Subject: proposed revision of format.cpu Message-ID: In working with several archives and drawing on other IT experience, I'd like to make some proposed changes to the format.cpu schema, (without regurgitating the entire history of computing in the process :-). --- 1.0 OLAC Schema for CPUs, Steven Bird, 5/7/01 1.1 draft OLAC Schema for CPU, Baden Hughes, 19/09/02 --- You can also find this draft schema at http://www.compuling.net/projects/olac/190902-draft-olac-format.cpu.xsd These changes essentially add to the list if possible operating systems that I've encountered in classifying cpu architectures relevant to language software. This includes some older mid-range style architectures and the latest handheld architectures. If preferred, I can circulate to the list. If there's others interested in working on this document, again I'm more than happy to collaborate. Baden From baden at COMPULING.NET Mon Sep 23 02:06:04 2002 From: baden at COMPULING.NET (Baden Hughes) Date: Mon, 23 Sep 2002 12:06:04 +1000 Subject: fproposed revision of format.sourcecode Message-ID: After a survey of several language archives, I'd like to propose some possible changes to the format.sourceode schema. Essentially this list is a list of programming languages of various types, in which software may be written. This list includes those found at: http://www.hypernews.org/HyperNews/get/computing/lang-list.html A draft can be found online at: http://www.compuling.net/projects/olac/220902-draft-olac-format.sourceco de.xsd Comments welcome. Baden From sb at UNAGI.CIS.UPENN.EDU Mon Sep 23 06:38:54 2002 From: sb at UNAGI.CIS.UPENN.EDU (Steven Bird) Date: Mon, 23 Sep 2002 02:38:54 EDT Subject: fproposed revision of format.sourcecode In-Reply-To: Your mail dated Monday 23 September, 2002. Message-ID: Baden Hughes wrote: > After a survey of several language archives, I'd like to propose some > possible changes to the format.sourceode schema. Essentially this list > is a list of programming languages of various types, in which software > may be written. This list includes those found at: > http://www.hypernews.org/HyperNews/get/computing/lang-list.html > > A draft can be found online at: > http://www.compuling.net/projects/olac/220902-draft-olac-format.sourcecode.xsd > > Comments welcome. This is great - a 20-fold increase on the number listed in my original 0.4 list. I grepped for a few obscure languages and they were all there. I'd like to raise two low-level technical issues, capitalization and whitespace. First, 99% of the codes are all-caps, even though some programming language names are not written like this (e.g. the list gives "PROLOG" but it should really be "Prolog"). However, rather than having to settle disputes about this question, I'd prefer it if we case-normalized everything. What do people think - should we standardize on uppercase? Second, Baden's list includes many items with spaces, e.g. "OBJECTIVE CAML". However, it seems desirable to limit the range of characters that can appear in a controlled vocabulary item (e.g. no accents) so that there is no transmission problems etc. In some contexts, such as hand-crafted CGI Get requests and HTML anchors, it is a pain to have to manually escape the space character. Could we live with a restriction of no spaces - i.e. replacing spaces with underscore? ** Note that neither of these issues is substantive, since each controlled vocabulary item will be associated with a human readable form (including translations into other languages). For example, in Dublin Core, there is a refinement named "hasVersion" with the human-readable label "Has Version". [http://www.dublincore.org/documents/dcmes-qualifiers/]. The plan is to do the same thing for OLAC vocabularies. -Steven From baden at COMPULING.NET Mon Sep 23 07:00:36 2002 From: baden at COMPULING.NET (Baden Hughes) Date: Mon, 23 Sep 2002 17:00:36 +1000 Subject: fproposed revision of format.sourcecode In-Reply-To: <200209230639.g8N6csL10762@unagi.cis.upenn.edu> Message-ID: I've updated the format.sourcecode schema draft with: -unnecessary whitespace removed -whitespace normalized to underscores in enumeration values -typos corrected You can find the updated list here: http://www.compuling.net/projects/olac/230902-draft-olac-format.sourceco de.xsd There's currently 285 programming languages listed on this schema. If any one has any more to add, drop me an email. Regards Baden > -----Original Message----- > From: Steven Bird [mailto:sb at unagi.cis.upenn.edu] > Sent: Monday, 23 September 2002 16:39 > To: baden at compuling.net > Cc: OLAC-IMPLEMENTERS at LISTSERV.LINGUISTLIST.ORG > Subject: Re: fproposed revision of format.sourcecode > > > > Baden Hughes wrote: > > After a survey of several language archives, I'd like to > propose some > > possible changes to the format.sourceode schema. > Essentially this list > > is a list of programming languages of various types, in > which software > > may be written. This list includes those found at: > > http://www.hypernews.org/HyperNews/get/computing/lang-list.html > > > > A draft can be found online at: > > > http://www.compuling.net/projects/olac/220902-> draft-olac-format.source > > code.xsd > > > > Comments welcome. > > This is great - a 20-fold increase on the number listed in my > original 0.4 list. I grepped for a few obscure languages and > they were all there. > > I'd like to raise two low-level technical issues, > capitalization and whitespace. > > First, 99% of the codes are all-caps, even though some > programming language names are not written like this (e.g. > the list gives "PROLOG" but it should really be "Prolog"). > However, rather than having to settle disputes about this > question, I'd prefer it if we case-normalized everything. > What do people think - should we standardize on uppercase? > > Second, Baden's list includes many items with spaces, e.g. > "OBJECTIVE CAML". However, it seems desirable to limit the > range of characters that can appear in a controlled > vocabulary item (e.g. no accents) so that there is no > transmission problems etc. In some contexts, such as > hand-crafted CGI Get requests and HTML anchors, it is a pain > to have to manually escape the space character. Could we > live with a restriction of no spaces - i.e. replacing spaces > with underscore? > > ** Note that neither of these issues is substantive, since > each controlled vocabulary item will be associated with a > human readable form (including translations into other > languages). For example, in Dublin Core, there is a > refinement named "hasVersion" with the human-readable label > "Has Version". > [http://www.dublincore.org/documents/dcmes-> qualifiers/]. > The > plan is to do the same thing for OLAC vocabularies. > > -Steven > From ruyng at GATE.SINICA.EDU.TW Mon Sep 23 10:29:42 2002 From: ruyng at GATE.SINICA.EDU.TW (Ru-Yng Chang) Date: Mon, 23 Sep 2002 06:29:42 -0400 Subject: fproposed revision of format.sourcecode Message-ID: Dear all, I find the difference between the draft and the code for program language of the standard of Chinese catalogue from National Central Library. http://datas.ncl.edu.tw/catweb/2-1-2a.htm(Big-5 encoding.) As the list. ---A----------- ADAPTIVE SERVER ENTERPRISE ADS-C AL ALPHARD ANALITIK ANNA APL2 ---B----------- BCY/B ---C----------- CADL CALM CANDE CCL CIP-L CLIPPER COLTS COMSKEE CONCURRENT_EUCLID ---D----------- D.L.LOGO DATAPLOT DBL DIST DYNAMO ---E----------- EDISON ELAN ---F----------- FOCUS FRED ---G----------- GHC GLYPNIR ---H----------- HYPERTALK ---I----------- IDL INFORMIX-4GL INTERPRESS ISETL ISP ---J----------- JAVA JAVA_APPLET (INCLUED IN JAVA) JAVA_WORKSHOP (INCLUED IN JAVA) JOSEF ---K----------- KHUWARIZMI KYLIX ---L----------- LISP LOGLAN_82 LOGO LOTUS_SCRIPT LUCID ---M----------- MACRO-11 MFC MODULA-2 MOUSE ---M----------- NATAL NPL ---O----------- OCCAM2 OPS5 ---P----------- PARAGON PARLOG PILOT PLEASE PL/1 PL/M51 PL/SQL POP11 PORTAL PSEUDOCODE PUCMAT ---Q----------- QEDIT ---R----------- ROSS ---S----------- S-ALGOL SGML SHELL SIMNET SMAL/80 SNAP SNOBOL SPECOL SPITBOL SQL/ORACLE STAROFFICE STEP_3 STEP_5 SURVIS ---T----------- T TIME_SERIES_PROCESSOR TURBO TUTOR ---U----------- UCSD_PASCAL UNIGRAPHICS UNISON_AUTHOR_LANGUAGE Ru-Yng From baden at COMPULING.NET Mon Sep 23 13:28:23 2002 From: baden at COMPULING.NET (Baden Hughes) Date: Mon, 23 Sep 2002 23:28:23 +1000 Subject: proposed revision of format.sourcecode In-Reply-To: Message-ID: An updated version of the format.sourcecode schema is now available online with additions from Ru-Yng Chang. http://www.compuling.net/projects/olac/240902-draft-olac-format.sourceco de.xsd Regards Baden From gary.holton at UAF.EDU Tue Sep 24 14:07:00 2002 From: gary.holton at UAF.EDU (Gary Holton) Date: Tue, 24 Sep 2002 10:07:00 -0400 Subject: A simpler format for OLAC vocabularies and schemes Message-ID: On Mon, 16 Sep 2002 17:39:54 EDT, Steven Bird wrote: >-- > >So, what do you think? Do you agree with our proposals for >(i) a syntactic simplification in our XML representation, and >(ii) switching OLAC vocabularies from being centrally validated >standards to recommendations? We would welcome your feedback. > Dear Steven & Gary, I haven't had much time to digest your proposal, but my initial reaction is very positive. Regarding (i), it is clear that a syntactic simplification is needed. I for one have never been able to keep straight refinements vs. schemes, and I don't think I'm alone here. And as you point out (ii), the real issue should be not whether a particular refinement (and associated vocabulary) has been officially adopted (mandated?), but rather whether a such a refinement is useful to the community. We can debate ontologies, but it is more difficult to debate usefulness without actually implementing a refinement. Your proposal would permit refinements ("extensions") to fit the needs of the community, so that useful solutions could evolve. I have often approached the metadata issue by trying to imagine what types of refinements and vocabularies would be useful to the end user. The difficulty is that we don't know enough about how the user will be searching, what they will be searching for, and what types of search facilities they will have. The best we can do at this point is make an educated guess and then watch closely to see how the refinements and vocabularies are actually used. That said, I think we have some very good guesses already and will certainly be able to recommend best practices by December. However, if we lock in the vocabularies then most archives will continue to have to support both an OLAC schema and a user-defined schema (as you point out). This would essentially remove the data provider from the loop, in that user-defined schemas would be viewed as idiosyncratic and non-standard. Allowing user-defined "extensions" would encourage innovation on the part of both data and service providers--innovation mediated by the end user. Any reactions from others? Gary Holton From sb at UNAGI.CIS.UPENN.EDU Tue Sep 24 22:25:11 2002 From: sb at UNAGI.CIS.UPENN.EDU (Steven Bird) Date: Tue, 24 Sep 2002 18:25:11 EDT Subject: A simpler format for OLAC vocabularies and schemes Message-ID: Thanks for the positive feedback. While we await more reactions let me jump in and say that Gary and I are working on a revised version of the proposal to bring it into line with new developments in the Dublin Core Metadata Initiative (DCMI). We'll preserve the new extensibility that people seem to appreciate, but also make syntactic changes to maximize interoperability with the wider digital libraries community. In the past we've basically gone it alone in working out how to represent our own DC qualifications in XML. However, the timing of these recommendations and our forthcoming workshop present us with a new opportunity to standardize our implementation. If you'd like to learn more about what's happening in DCMI with qualifiers and XML, please see the following article and the material it cites: Recommendations for XML Schema for Qualified Dublin Core Proposal to DC Architecture Working Group http://www.ukoln.ac.uk/metadata/dcmi/xmlschema/ Next week we'll circulate a proposal for how OLAC can conform with this. Note that this is only about XML implementation and not OLAC content. For those who only care about disseminating metadata, conformance with the DCMI recommendations will ensure maximal interoperability with the wider digital libraries community, so that your metadata pops up all over cyberspace. Back on the subject of extensibility... The key innovation in our recent proposal, that we'd still like more feedback on, is for the OLAC vocabularies to be changed from being centrally enforced standards to recommended practices. Under this model, any archive will be able to adopt and promulgate its favorite ontologies, while the OLAC Process is still used to identify community-agreed best practices that everyone should follow. For instance, consider the sourcecode vocabulary, which is only relevant to the software archives and which may need constant updates. Under the proposed model, the vocabulary wouldn't actually need to reside on the OLAC site; it could live wherever it could be easily maintained. However, the OLAC site would host the details of any associated working group, so that others could discover the group and contribute to the revision of the vocabulary. It would also host any associated OLAC recommendation, so that everyone would know that the OLAC community had adopted a certain vocabulary as best practice. -Steven From jcgood at SOCRATES.BERKELEY.EDU Tue Sep 24 23:23:50 2002 From: jcgood at SOCRATES.BERKELEY.EDU (Jeff Good) Date: Tue, 24 Sep 2002 16:23:50 -0700 Subject: A simpler format for OLAC vocabularies and schemes In-Reply-To: <200209162139.g8GLdsL27812@unagi.cis.upenn.edu> Message-ID: Hello, I wanted to say that I think the basic designs of the revisions proposed by Steven and Gary are very good suggestions. I completely agree with Gary Holton's points--so I won't repeat them. I thought I'd point out how I think these revisions can be usefully applied to some problems that the working group evaluating the linguistic types document. I think this new format will allow us to get past many issues which I thought may have been intractable. I guess I consider this to be a good "empirical" test of the proposal. The specific problem was that there are many cross-cutting ways to classify the "type" of a linguistic document. There's a sense in which a document focuses on a big sub-field of linguistics like phonology, morphology, etc. There's the basic structure of a document: dictionary, grammar, text (the term "macrostructure" can be used to describe this category). And then there are important "meso/micro-structure" aspects of documents---like the type of transcription used (free translation, interlinear, etc.) The original OLAC system encouraged us to create an ontology of document types which assumed that there was one "type" for a document, when, in reality, type is a multi-dimensional concept. As we realized this, we started to break down the types into the most important dimensions--like linguistic subject, basic structure, etc. But even then, there were problems of classification. For example, categories like "oratory", "narrative", "ludic" seemed appropriate for some linguistic documents--but it isn't immediately clear where they belong in a hierarchy of types (are they structural or content types? or are they something else?). It was possible to create a system of types which works, but I think many of our conceptual and implementational problems can be more cleanly solved by the new systems because of it extensibility. Specifically, rather than having to pigeonhole types into a few categories in a hierarchy, we can just propose a series of vocabularies corresponding to the potentially independent "type" parameters of a document--for example, a linguistic subject vocabulary, a document structural type vocabulary, a "discourse"-type vocabulary for things like "oratory" and "narrative". (For more detail on this, there are relevant recent posts, one from me, on the Metadata list.) Over time, I'm sure we'll find some of the vocabularies are more useful/used than others--and these can become OLAC recommended standard vocabularies. I think the real value of the new system will be that it is much more forgiving/flexible if we find we need to adapt our "type" categories in the future. Since Steven just posted about the idea that vocabularies be recommended practices, I'll say that I think that aspect of the proposal is also very helpful to working out a linguistic type vocabulary. One thing that at least I am convinced of in the discussion of "types" is that there is a counterexample to every generalization you can make about them. It may be the case that some counterexamples are minor enough that we can get away without a good classification for them. Or it might be the case that a counterexample is revealing a set of important omissions in the proposals. It's hard to tell without testing a lot of archives. A recommended, but not enforced, vocabulary would address this problem--as archivers encounter situations that aren't covered, they wouldn't be forced to "fit" their document into a category where it doesn't belong. This would not only promote the creation of needed new vocabulary items but also maintain the integrity of existing ones. Additionally, the idea of recommended vocabularies, plus a best practice standard, certainly is more in line with the general spirit of OLAC, and I think it would encourage more subcommunities to get involved and create vocabularies which they need. Jeff From baden at COMPULING.NET Wed Sep 25 04:31:35 2002 From: baden at COMPULING.NET (Baden Hughes) Date: Wed, 25 Sep 2002 14:31:35 +1000 Subject: A simpler format for OLAC vocabularies and schemes In-Reply-To: <200209162139.g8GLdsL27812@unagi.cis.upenn.edu> Message-ID: > So, what do you think? Do you agree with our proposals for > (i) a syntactic simplification in our XML representation, and The syntactic revision I personally agree with. Backwards and future compatibility is a significant factor and as such the new revisions I believe will make it easier to implement changes community wide and benefit archives who require special purpose extensions. > (ii) switching OLAC vocabularies from being centrally > validated standards to recommendations? We would welcome > your feedback. The proposal for recommendations rather than mandated standards seems to draw partially on both the W3C and IETF processes, whereby drafts or notes are submitted, reviewed, implemented and then reviewed with the view to standardisation if agreed as best practice. This process scales very well, and yet allows individuals or institutions the freedom to innovate whilst encouraging best practice once peer review of implementations has taken place. I think this is important to encourage innovation amongst participating archives who develop vocabularies to address their own needs first and then promote the benefits of these for wider community consideration. Baden From hdry at LINGUISTLIST.ORG Thu Sep 26 23:15:53 2002 From: hdry at LINGUISTLIST.ORG (Helen Dry) Date: Thu, 26 Sep 2002 19:15:53 -0400 Subject: A simpler format for OLAC vocabularies and schemes In-Reply-To: <200209162139.g8GLdsL27812@unagi.cis.upenn.edu> Message-ID: Hi, Steven (and everyone) Sorry to be so late responding to this proposal, but it's been a busy month. I am a little concerned about this proposal, perhaps because I don't understand exactly how the scheme system would work, so I thought I should make my comments and ask a few questions. Apologies if either or both are at a rather elementary level--I only seem to understand DC and XML for 10 minutes, right after I reread the websites. :-) It seems to me that there are two separable proposals here: (1) collapsing the formal mechanisms of refinement and scheme into the extension mechanism and (2) abandoning the attempt to reach general consensus on the descriptors that previously we were calling controlled vocabularies. The first may well be a welcome simplification, particularly administratively. (And I seem to have heard that it's the way the DC is going anyway.) The second seems worrisome to me for two primary reasons: (1) it seems counter to the overarching OLAC (and EMELD) goal of a unified--dare we say "standardized"?--mechanism for resource description and retrieval within the discipline; (2) on a practical level it may complicate--perhaps to a debilitating degree--the way that service providers implement search facilities. Of course, I'm thinking about LINGUIST here--we aren't an archive, so the potential benefits of being able to DESCRIBE resources via any scheme we might devise are not salient to me. What I'm worried about is how we're going to offer a search engine that makes use of all these variant descriptions. Particularly for something like linguistic data types--which is probably the main search field linguists will want to use--this seems almost like a return to the bad old days of the free text field, with the consequent loss of ability to identify and retrieve relevant resources. Now I imagine that there is some formal mechanism for relating schemes--I know you have a paragraph below about archives putting the schemes they use in their identifiers. But could you tell me exactly how this would work in practice? E.g., at the level of elements or terms? Would an archive that wants to use its own scheme have to provide a document showing how its categories relate to the categories in all the other schemes (e.g., that its "Seediq" was SIL's "Taroko.") Would the service provider have to construct a search engine that would first find and correlate all these documents, then search the multi-archive metadata for the resulting sets of terms? I'm sure it's possible--IF you could get everyone to provide scheme mappings--but it certainly seems unnecessarily complex. . . and, as I said, counter to the purpose of OLAC. I thought we were trying to settle on a unified way to describe linguistic resources, in order to offer the discipline the benefits of a level of standardization. Though this will come at the admitted expense of a certain amount of detail and precision, I feel confident that it will be accepted (accepted for what it is) if we persevere. After all, DC isn't perfect but people understand the utility of a restricted set of elements. It seems to me that, if the problem is that we may not come up with a proposal before December, we should either redouble our efforts and make the deadline or extend the deadline--not scrap the enterprise. Actually, with regard to linguistic data types, I feel confident we can come up with a reasonable proposal before the deadline. And I think it's important that we do so, since this is really one of the most important vocabularies--probably the most important for a large part of our audience, i.e. academic linguists. It's the main way that people, as opposed to machines, will want to search the archives. So, in sum, I agree with the arguments for using the extension mechanism and abandoning refinement and scheme. But I don't see the need to abandon the goal of reaching consensus on a single "OLAC-approved" set of linguistic data types, however that would be modeled in a world of "extensions" (not controlled vocabularies). Can we use extensions but not let in the world? BTW, under the proposal, will all the current refinements--e.g., "subject.language" now become schemes? But now I should stop and let someone knowledgable explain to me exactly how this scheme system will work. I'm all ears . . . . :-) Ready for enlightenment .... -Helen On 16 Sep 2002 at 17:39, Steven Bird wrote: The OLAC metadata format provides two mechanisms for community- specific resource description. First, special refinements (metadata elements and corresponding vocabularies) support compatible description across the community. For example, the subject.language element, and the OLAC-Language vocabulary, permit all archives to identify subject language in the same manner. Second, every OLAC element permits an optional scheme attribute for use by sub-communities of OLAC. For example, the scholars at Academia Sinica can use their own naming scheme for Formosan languages and still package it up using the OLAC metadata container. This combination of standard refinements and user-defined schemes seems to offer a reasonable balance between interoperability and extensibility. Over the past month, Gary and I have been reviewing the design of OLAC metadata and have concluded that these parallel mechanisms are unnecessary. We think that with a *single* extension mechanism, OLAC can provide even better interoperability and extensibility. Moreover, we think this can be done with less administrative and technical infrastructure than before, making it still easier for archives to participate in OLAC. A. THE PRESENT SITUATION We begin with a quick review of how the two existing mechanisms work in OLAC metadata. First, community-specific refinements are represented using Dublin Core qualifications represented in XML. Here is an example for subject language: A resource about the Sikaiana language: This refinement permits focussed searching and better precision/recall than the corresponding Dublin Core element: The Sikaiana Language The OLAC version is flexible in that the code attribute is optional and that free-text can be put in the element content. The second mechanism is for user-defined schemes. All OLAC elements permit a scheme attribute, naming some third-party format or vocabulary that one or more OLAC archives use. For instance, the language listed by Ethnologue as Taroko (TRV) is known as Seediq in Academia Sinica, and OLAC would permit either or both of the following elements to appear in a metadata record for this language: Seediq Such a resource would be discovered under either naming scheme, and Academia Sinica could provide end-user services that rewarded any archive which employed its scheme for Formosan language identification. B. PROBLEMS WITH THE PRESENT SITUATION There are four general problems with the present situation. 1. Finalizing standard refinements. Our track record at developing controlled vocabularies over the past year indicates that we are not going to be able to finalize all the vocabularies that the OLAC metadata standard specifies in time for launching version 1.0 after our December workshop. Even if some vocabularies are finalized by December, the discussion may be reopened any time a new kind of archive joins OLAC. However, each vocabulary revision must currently be released as a new version of the entire OLAC metadata set, an unacceptable bureaucratic obstacle. 2. The artificial distinction between refinements and schemes. It is not clear when a putative refinement is important enough to be adopted as an OLAC standard, versus a user-defined scheme. Some of the refinements we recognize at present aren't as germane to the overall enterprise as others (e.g. operating system vs subject language), and may not have enough support to be retained. Conversely, the community is sure to develop new, useful ontologies that we don't support at present, and we would need to change the OLAC metadata standard in order to accommodate them. Promoting a user-defined scheme to an OLAC standard would necessitate a change in the XML representation, generating unnecessary work for all archives that support the scheme. 3. Duplication of technical support. User-defined schemes are likely to involve controlled vocabularies, with the same needs as OLAC vocabularies with respect to validation, translation to human-readable form in service providers, and dumb-down to Dublin Core for OAI interoperability. At present, the necessary infrastructure must be created twice over, once for each of the two mechanisms. 4. Idiosyncracies of XML schema. XML schema is used to define the well-formedness of OLAC records, but it is unable to express co-occurrence constraints between attribute values. This means that we cannot have more than one vocabulary for an element, forcing us to build structure into element names and multiply the names (e.g. Format.markup, Format.cpu, Format.os, ...). It is unfortunate that such a fundamental aspect of the OLAC XML format depends on a shortcoming of a tool that we may not be using for very long. In sum, the current model will be difficult to manage over the long term. Administratively, it encourages us to seek premature closure on issues of content description that can never be closed. Technically, it forces us to release new versions of the metadata format with each vocabulary revision, and forces us to create software infrastructure to support a mishmash of four syntactic extensions of DC: C. A NEW APPROACH In response to the problems outlined above, we would like to propose a new approach. The basic idea is simple: express all refinements, vocabularies and schemes using a uniform DC extension mechanism, and treat them all as recommendations instead of centrally-validated standards. The extension mechanism requires two attributes, called "extension" and "code", as shown below: It would be syntactically valid to simply use an extension in metadata without defining it. However, for extensions that will be used across the community, there must also be a formal definition that enumerates the corresponding controlled vocabulary in such a way that data providers and service providers alike can harvest the vocabulary from its definitive source. Thus another aspect of the new approach is an XML schema for the formal definition of an XDC extension. In the description section of the OAI Identify response, a data provider would declare which formally defined extensions it employs in its metadata. Extensions that enjoyed broad community support would be identified as OLAC Recommendations (following the existing OLAC Process). All OLAC archives would be encouraged to adopt them, in the sense that OLAC service providers would permit end-users to perform focussed searches over these extensions. In this way, archives that cooperate with the rest of the community are rewarded. Note that the approach isn't specific to language archives, so we're calling it extensible Dublin Core (XDC). An example of the syntax is available (an XML DTD, the equivalent XML schema, and an instance document): http://www.language-archives.org/XDC/0.1/ D. BENEFITS The new approach is technically simpler than the existing approach, and neatly solves the four problems we reported. 1. Finalizing standard refinements. The editors of OLAC vocabulary documents would be empowered to edit the vocabulary into the future, without concern for integration with new releases of the OLAC metadata format. 2. The artificial distinction between refinements and schemes. The syntactic distinction is gone, being replaced by a semantic one: is the vocabulary an OLAC Recommendation or not? Any archive or group of archives would be free to start using their own extensions without any formal registration. They could build a service to demonstrate the merit of their extension, thereby encouraging other archives to adopt it. Once broad support had been established, they could build a case for an OLAC Recommendation, leading to adoption across the community. 3. Duplication of technical support. With the single extension mechanism, we can provide uniform technical support for validation, translation and dumb-down. 4. Idiosyncracies of XML schema. We no longer give XML schema such sway in determining our XML syntax. Other XML and database technologies will be used to test that an extension is used correctly. In sum, the new approach is extensible, requiring no central administration of extensions, and no coordination of vocabulary revisions with new releases of the metadata format. The new approach also supports interoperability across the whole OLAC community (via OLAC Recommendations) and also among OLAC sub-communities that want to create their own special-purpose extensions. E. IMPLICATIONS We are still working out the technical implications for OLAC central services (e.g. registration, Vida, ORE, etc), and we will only be able to implement parts of this in time for the December meeting. As always, we would welcome donations of programmer time to help us. The short-term implication for OLAC archives is completely trivial, since only a simple syntactic change is required. The most important implication of this change is that it reduces the pressure to reach final agreement on OLAC vocabularies by our December workshop. But this isn't an excuse for us to slow down on that front. On the contrary, it frees us up to find working solutions for the key vocabularies that define us as a community. These will always be imperfect compromises that we can agree to work with and revise as necessary, well into the future. In sum, we hope we are not opening up a technical can of worms, but facilitating progress on the substantive issues, our common descriptive ontologies. Therefore, we encourage people to identify a particular extension that they would like to work on, and post their ideas and questions to this list (as Baden Hughes has just now done for sourcecode). You may also like to present your ideas at our workshop in December... -- So, what do you think? Do you agree with our proposals for (i) a syntactic simplification in our XML representation, and (ii) switching OLAC vocabularies from being centrally validated standards to recommendations? We would welcome your feedback. Steven Bird & Gary Simons From hdry at LINGUISTLIST.ORG Thu Sep 26 23:36:44 2002 From: hdry at LINGUISTLIST.ORG (Helen Dry) Date: Thu, 26 Sep 2002 19:36:44 -0400 Subject: A simpler format for OLAC vocabularies and schemes In-Reply-To: Message-ID: Hi, Gary (and everyone), I've just sent a long posting to the list explaining some of my problems with Steven's & Gary's proposal, so all I want to do here is respond briefly. I completely agree with your point about the value of syntactic simplification. But I'm not sure about the second point--reducing all OLAC vocabularies to recommendations. It's interesting where our opinions diverge--i.e., you see the benefits to the archive, which may already have a user-defined scheme, and I see the possible problems for the general service provider, which may not be able to handle multiple user-defined schemes in an efficient way. Perhaps OLAC can handle this problem by making STRONG recommendations . . . but in that case, I don't see the real difference between recommendations and a centrally validated standard . . . except for the fact that OLAC wouldn't have to re-publish all the metadata whenever a recommendation changed. I suppose this would be an administrative advantage-- but enough of a one to lose the potential benefits of standardization??? I'm waiting to be convinced.... -Helen On 24 Sep 2002 at 10:07, Gary Holton wrote: On Mon, 16 Sep 2002 17:39:54 EDT, Steven Bird wrote: >-- > >So, what do you think? Do you agree with our proposals for >(i) a syntactic simplification in our XML representation, and >(ii) switching OLAC vocabularies from being centrally validated >standards to recommendations? We would welcome your feedback. > Dear Steven & Gary, I haven't had much time to digest your proposal, but my initial reaction is very positive. Regarding (i), it is clear that a syntactic simplification is needed. I for one have never been able to keep straight refinements vs. schemes, and I don't think I'm alone here. And as you point out (ii), the real issue should be not whether a particular refinement (and associated vocabulary) has been officially adopted (mandated?), but rather whether a such a refinement is useful to the community. We can debate ontologies, but it is more difficult to debate usefulness without actually implementing a refinement. Your proposal would permit refinements ("extensions") to fit the needs of the community, so that useful solutions could evolve. I have often approached the metadata issue by trying to imagine what types of refinements and vocabularies would be useful to the end user. The difficulty is that we don't know enough about how the user will be searching, what they will be searching for, and what types of search facilities they will have. The best we can do at this point is make an educated guess and then watch closely to see how the refinements and vocabularies are actually used. That said, I think we have some very good guesses already and will certainly be able to recommend best practices by December. However, if we lock in the vocabularies then most archives will continue to have to support both an OLAC schema and a user-defined schema (as you point out). This would essentially remove the data provider from the loop, in that user-defined schemas would be viewed as idiosyncratic and non-standard. Allowing user-defined "extensions" would encourage innovation on the part of both data and service providers--innovation mediated by the end user. Any reactions from others? Gary Holton From Gary_Simons at SIL.ORG Fri Sep 27 00:24:30 2002 From: Gary_Simons at SIL.ORG (Gary Simons) Date: Thu, 26 Sep 2002 19:24:30 -0500 Subject: A simpler format for OLAC vocabularies and schemes Message-ID: Helen, You hit the nail on the head when you observe: "in that case, I don't see the real difference between recommendations and a centrally validated standard". It was that same observation, but coming from the point of view of our status quo, that has been a key part of the motivation as Steven and I have been thinking about what our version 1.0 standard should look like. In version 0.4 we have a centrally validated and mandated standard, but it has built-in optionality. For instance, it is our standard to use SIL and Linguist codes to identify languages precisely, but data providers also have the option of just providing free text. Thus the standard is currently not requiring language codes but only recommending them as best practice, and an examination of the harvested records from our 20 or so participating data providers reveals the fact that many sites are not now using codes. Our proposal to take the controlled vocabularies out of the standard and to treat them as best practice recommendations thus does not really change the current reality. In fact, it probably gives a better reflection of the reality. One key advantage from the point of view of managing the infrastructure is that it will not be necessary to change the standard when controlled vocabularies are changed or added. The metadata standard would just specify the structure of the container record and the mechanism for defining metadata extensions and would be very static. Each controlled vocabulary would be managed separately in an independent document and in a formal extension definition that would supply downloadable code sets so that extension data can still be centrally validated. When the community reaches a consensus that a particular vocabulary should be used when applicable, then it would become a community Recommendation and our default harvester would support it. Service providers would exploit it (such as Linguist is now doing with searching by language) and that would show data providers who are not yet using the vocabulary the benefits of using it. We could even have a "Recommended practice report card" that would show which recommended extensions an archive is using and which it is not. Thus Steven and I are assuming that the end result of this change would not weaken compliance to standardized vocabularies (which is already optional), but that it would make it much easier to manage changes to vocabularies and to experiment with specialized vocabularies. I hope that helps to clarify where we are coming from. -Gary Simons Helen Dry Sent by: OLAC Implementers List To: .ORG cc: Subject: Re: A simpler format for OLAC 09/26/02 06:36 PM vocabularies and schemes Please respond to Open Language Archives Community Implementers List Hi, Gary (and everyone), I've just sent a long posting to the list explaining some of my problems with Steven's & Gary's proposal, so all I want to do here is respond briefly. I completely agree with your point about the value of syntactic simplification. But I'm not sure about the second point--reducing all OLAC vocabularies to recommendations. It's interesting where our opinions diverge--i.e., you see the benefits to the archive, which may already have a user-defined scheme, and I see the possible problems for the general service provider, which may not be able to handle multiple user-defined schemes in an efficient way. Perhaps OLAC can handle this problem by making STRONG recommendations . . . but in that case, I don't see the real difference between recommendations and a centrally validated standard . . . except for the fact that OLAC wouldn't have to re-publish all the metadata whenever a recommendation changed. I suppose this would be an administrative advantage-- but enough of a one to lose the potential benefits of standardization??? I'm waiting to be convinced.... -Helen On 24 Sep 2002 at 10:07, Gary Holton wrote: On Mon, 16 Sep 2002 17:39:54 EDT, Steven Bird wrote: >-- > >So, what do you think? Do you agree with our proposals for >(i) a syntactic simplification in our XML representation, and >(ii) switching OLAC vocabularies from being centrally validated >standards to recommendations? We would welcome your feedback. > Dear Steven & Gary, I haven't had much time to digest your proposal, but my initial reaction is very positive. Regarding (i), it is clear that a syntactic simplification is needed. I for one have never been able to keep straight refinements vs. schemes, and I don't think I'm alone here. And as you point out (ii), the real issue should be not whether a particular refinement (and associated vocabulary) has been officially adopted (mandated?), but rather whether a such a refinement is useful to the community. We can debate ontologies, but it is more difficult to debate usefulness without actually implementing a refinement. Your proposal would permit refinements ("extensions") to fit the needs of the community, so that useful solutions could evolve. I have often approached the metadata issue by trying to imagine what types of refinements and vocabularies would be useful to the end user. The difficulty is that we don't know enough about how the user will be searching, what they will be searching for, and what types of search facilities they will have. The best we can do at this point is make an educated guess and then watch closely to see how the refinements and vocabularies are actually used. That said, I think we have some very good guesses already and will certainly be able to recommend best practices by December. However, if we lock in the vocabularies then most archives will continue to have to support both an OLAC schema and a user-defined schema (as you point out). This would essentially remove the data provider from the loop, in that user-defined schemas would be viewed as idiosyncratic and non-standard. Allowing user-defined "extensions" would encourage innovation on the part of both data and service providers--innovation mediated by the end user. Any reactions from others? Gary Holton From hdry at LINGUISTLIST.ORG Fri Sep 27 16:46:45 2002 From: hdry at LINGUISTLIST.ORG (Helen Aristar Dry) Date: Fri, 27 Sep 2002 12:46:45 -0400 Subject: A simpler format for OLAC vocabularies and schemes In-Reply-To: Message-ID: Hi, Gary, Yes, I take your point that we can't force compliance; and, in general, I'd be all for letting standards evolve from usage. But actually, from the point of view of the LINGUIST service provider, the languages example isn't a heartening one. What our programmer had to do to search harvested OLAC metadata by subject language is write a special program that translates any text entry in the subject language field into the SIL code. This is possible to do with languages only because we have the Ethnologue name and alternate name tables on the site, and therefore we have a list of almost all the language names that any site might be using. It's still a lot of work, and we're no doubt missing or misclassifying the subject languages of a lot of records. Nevertheless, we do have a search engine that is using Ethnologue codes to identify resources by subject.language, thereby demonstrating the utility of this recommendation. But what are we going to do for linguistic data type and all the other erstwhile controlled vocabularies?? There's no "alternate name" reference for extensions (at least not as far as I know), such that we could use it to write a translation program . . even if it were feasible to translate every relevant value in every metadata record. And it makes no sense to set up search facilities that use the recommended vocabulary if there's no data classified by it--getting a lot of "not found" messages will discourage users from using the recommended vocabulary, not encourage it. So our search engine is not going to be any help in promulgating these recommendations. Sigh. I realize that mandating a controlled vocabulary wouldn't ensure that archives used it. Perhaps it would give them a little more impetus, however. And it would certainly be nice if each archive would "translate" its user-defined metadata into the recommended OLAC vocabulary, rather than leaving the service provider to figure out how to do it for multiple archives, each with its own idiosyncratic and undocumented set of extensions. I'm still hoping that you and Steven will come up with some bright ideas about how to help/encourage/convince archives to do this . . . Sorry to be negative. You know I think OLAC is the best thing since sliced bread. . . . I'm just having some trouble figuring out how we're going to cope with the new-fangled slices.... All the best, -Helen Date sent: Thu, 26 Sep 2002 19:24:30 -0500 Send reply to: Open Language Archives Community Implementers List From: Gary Simons Subject: Re: A simpler format for OLAC vocabularies and schemes To: OLAC-IMPLEMENTERS at LISTSERV.LINGUISTLIST.ORG > Helen, > > You hit the nail on the head when you observe: "in that case, I don't see > the real difference between recommendations and a centrally validated > standard". It was that same observation, but coming from the point of view > of our status quo, that has been a key part of the motivation as Steven and > I have been thinking about what our version 1.0 standard should look like. > > In version 0.4 we have a centrally validated and mandated standard, but it > has built-in optionality. For instance, it is our standard to use SIL and > Linguist codes to identify languages precisely, but data providers also > have the option of just providing free text. Thus the standard is > currently not requiring language codes but only recommending them as best > practice, and an examination of the harvested records from our 20 or so > participating data providers reveals the fact that many sites are not now > using codes. > > Our proposal to take the controlled vocabularies out of the standard and to > treat them as best practice recommendations thus does not really change the > current reality. In fact, it probably gives a better reflection of the > reality. One key advantage from the point of view of managing the > infrastructure is that it will not be necessary to change the standard when > controlled vocabularies are changed or added. The metadata standard would > just specify the structure of the container record and the mechanism for > defining metadata extensions and would be very static. Each controlled > vocabulary would be managed separately in an independent document and in a > formal extension definition that would supply downloadable code sets so > that extension data can still be centrally validated. When the community > reaches a consensus that a particular vocabulary should be used when > applicable, then it would become a community Recommendation and our default > harvester would support it. Service providers would exploit it (such as > Linguist is now doing with searching by language) and that would show data > providers who are not yet using the vocabulary the benefits of using it. > We could even have a "Recommended practice report card" that would show > which recommended extensions an archive is using and which it is not. > > Thus Steven and I are assuming that the end result of this change would not > weaken compliance to standardized vocabularies (which is already optional), > but that it would make it much easier to manage changes to vocabularies and > to experiment with specialized vocabularies. > > I hope that helps to clarify where we are coming from. > > -Gary Simons > > > > > > Helen Dry > Sent by: OLAC Implementers List To: > STLIST.ORG> .ORG > cc: > Subject: Re: A simpler format fo r OLAC > 09/26/02 06:36 PM vocabularies and schemes > Please respond to Open Language > Archives Community Implementers > List > > > > > > Hi, Gary (and everyone), > > I've just sent a long posting to the list explaining some of my problems > with Steven's > & Gary's proposal, so all I want to do here is respond briefly. I > completely agree with > your point about the value of syntactic simplification. But I'm not sure > about the > second point--reducing all OLAC vocabularies to recommendations. It's > interesting > where our opinions diverge--i.e., you see the benefits to the archive, > which may > already have a user-defined scheme, and I see the possible problems for the > general service provider, which may not be able to handle multiple > user-defined > schemes in an efficient way. Perhaps OLAC can handle this problem by > making > STRONG recommendations . . . but in that case, I don't see the real > difference > between recommendations and a centrally validated standard . . . except for > the fact > that OLAC wouldn't have to re-publish all the metadata whenever a > recommendation changed. I suppose this would be an administrative > advantage-- > but enough of a one to lose the potential benefits of standardization??? > I'm waiting > to be convinced.... > > -Helen > > > > On 24 Sep 2002 at 10:07, Gary Holton wrote: > > On Mon, 16 Sep 2002 17:39:54 EDT, Steven Bird > wrote: > >-- > > > >So, what do you think? Do you agree with our proposals for > >(i) a syntactic simplification in our XML representation, and > >(ii) switching OLAC vocabularies from being centrally validated > >standards to recommendations? We would welcome your feedback. > > > > > Dear Steven & Gary, > > I haven't had much time to digest your proposal, but my initial reaction is > very positive. Regarding (i), it is clear that a syntactic simplification > is needed. I for one have never been able to keep straight refinements vs. > schemes, and I don't think I'm alone here. And as you point out (ii), the > real issue should be not whether a particular refinement (and associated > vocabulary) has been officially adopted (mandated?), but rather whether a > such a refinement is useful to the community. We can debate ontologies, but > it is more difficult to debate usefulness without actually implementing a > refinement. Your proposal would permit refinements ("extensions") to fit > the needs of the community, so that useful solutions could evolve. > > I have often approached the metadata issue by trying to imagine what types > of refinements and vocabularies would be useful to the end user. The > difficulty is that we don't know enough about how the user will be > searching, what they will be searching for, and what types of search > facilities they will have. The best we can do at this point is make an > educated guess and then watch closely to see how the refinements and > vocabularies are actually used. That said, I think we have some very good > guesses already and will certainly be able to recommend best practices by > December. However, if we lock in the vocabularies then most archives will > continue to have to support both an OLAC schema and a user-defined schema > (as you point out). This would essentially remove the data provider from > the loop, in that user-defined schemas would be viewed as idiosyncratic and > non-standard. Allowing user-defined "extensions" would encourage innovation > on the part of both data and service providers--innovation mediated by the > end user. > > Any reactions from others? > > Gary Holton