From sb at CS.MU.OZ.AU Tue Oct 1 07:33:05 2002 From: sb at CS.MU.OZ.AU (Steven Bird) Date: Tue, 1 Oct 2002 03:33:05 EDT Subject: Call for Participation: OLAC Workshop In-Reply-To: Your mail dated Thursday 8 August, 2002. Message-ID: Folks - the workshop is fast approaching; just over two months to go now. If you haven't already done so, please communicate your intention to participate to Gary and me, by replying to this email. We'll be circulating more details about the workshop soon. For now please take a look at the list of preparatory tasks from the original call, which I'm appending below. Thanks, Steven Bird > > WORKSHOP ON OPEN LANGUAGE ARCHIVES > Institute for Research in Cognitive Science (IRCS) > University of Pennsylvania, Philadelphia > December 10-12, 2002 > > Sponsored by the National Science Foundation project: > International Standards in Language Engineering (ISLE) > > > OLAC, the Open Language Archives Community, was founded at the > Workshop on Web-Based Language Documentation and Description, in > December 2000. During 2001, the OLAC development phase, the core > infrastructure for OLAC was built and alpha testers implemented data > providers. During 2002, the pilot phase, we froze the standards to > encourage wider adoption and experience with the metadata and the > protocol. At the close of 2002 we want to draw together all this > experience, make final revisions, and launch the operational phase. > With this launch, the OLAC standards will be promoted from "candidate" > to "adopted", and version 1.0 of the OLAC XML schemas will be released. > > > WORKSHOP GOALS > > The workshop will be tightly focussed on the following goals: > > 1. Standards: To revise the three proposed standards, the OLAC > Metadata Set, the OLAC Process document and the OLAC Protocol. > > 2. Vocabularies: To finalize the controlled vocabularies: linguistic > type, software functionality, rights, format, encoding, ... > > 3. Review: To give feedback to each participating archive on its use > of metadata, to review the services on the OLAC and LINGUIST sites. > > 4. Proposals: To hear new proposals for working groups, encoding > schemes, implementation notes and best practice recommendations, > and position papers on work that still needs to be done. > > In support of these goals, the workshop will consist of: > * group discussions, both plenary and in parallel working groups; > * review/editing of documents, both in working groups and in private; > * plus a limited number of presentations (cf goal 4). > > NB. No time will be allocated for project reports in the formal program. > > > PARTICIPATION > > The workshop is open to advisory board members and representatives of > participating archives, consistent with our core value of "Empowering > the Players" [http://www.language-archives.org/OLAC/process.html]. > > *** Please communicate your intention to participate by October 1. > > NB. If you have been thinking about becoming an OLAC data provider, now > would be a good time to act. Any archive that becomes a data provider > by October 1 will also be invited to participate in this foundation > setting workshop. For more information on becoming a data provider, > please see http://www.language-archives.org/docs/implement.html > > > SPONSORSHIP > > The workshop is being sponsored by the NSF ISLE project "International > Standards in Language Engineering". We have funding for accomodation > at the University Sheraton, a short walk from IRCS. No registration > fee will be charged. Some travel support may also be available. > > > PREPARATORY TASKS > > In order to ensure that the workshop achieves its goals, participants > will be expected to help create, review and edit draft documents ahead > of the meeting. We would like each person to contribute 1-2 days > each month to this effort from September onwards. The preparatory tasks > correspond to our workshop goals, and are as follows: > > 1. Standards: review all the standards documents and suggest revisions > > 2. Vocabularies: review some of the controlled vocabularies and > suggest revisions > > 3. Review: choose three participating archives besides your own and > suggest improvements to their use of metadata; review the > www.language-archives.org site and the www.linguistlist.org/olac/ > service and suggest improvements. > > 4. Proposals: draft an encoding scheme, an implementation note, a > best practice recommendation, or a proposal for anything else that > needs to be done, and present it to the group. > > The success of the workshop will depend on active participation in > these tasks. Comments circulated in advance will have the most impact > on our work. To facilitate the process we will use this list, > OLAC-Implementers, except where formal working groups have already > been established with their own lists. Note that OLAC-Implementers is > an open, unmoderated list, archived on the LINGUIST site at: > http://lists.linguistlist.org/archives/olac-implementers.html > > More information will be circulated in September. In the meantime, > please feel free to get started on any of the above tasks... > > Steven Bird & Gary Simons > > From sb at CS.MU.OZ.AU Thu Oct 3 01:38:48 2002 From: sb at CS.MU.OZ.AU (Steven Bird) Date: Wed, 2 Oct 2002 21:38:48 EDT Subject: Some comments on the LINGUIST service provider Message-ID: One of the workshop preparatory tasks is: > 3. Review: choose three participating archives besides your own and > suggest improvements to their use of metadata; review the > www.language-archives.org site and the www.linguistlist.org/olac/ > service and suggest improvements. I have three low-level comments on the LINGUIST service provider. I hope this feedback will make the service even better than it already is... a) The first page you come to is a long document with a search form some way down. I'd favor a very simple page (cf www.google.com) consisting of a search box, a link to the advanced search, and a link to "more about OLAC" which has all the original text. b) Users wanting "more powerful search" are directed to the "OLAC Query page". (Weren't we just on an OLAC query page?) Arriving on this new page, we see that it is called "OLAC Query Form: Simple Search". This is confusing, since we've just come from a simple search page expecting the more powerful search page, only to find that this is still only simple search. There's no pointer back to the really simple search. I'd prefer this to be called "Advanced Search" (both on the title and the incoming link), with a backpointer to the simple search. c) This second page points to yet another page, called Advanced Search. However, this generates an error: "ODBC Error Code = S1000 (General error) [TCX][MyODBC]Table 'OLAC.alltypes' doesn't exist". I expect this really advanced search permits search on all fields. I'm not convinced we need three levels of search. Could the second and third levels be collapsed into a single level, containing all the search fields? Does anyone else have comments on this service? -Steven -- Steven Bird Email: Web: http://www.cs.mu.oz.au/~sb/ A/Prof, Dept of Computer Science, University of Melbourne, Vic 3010, AUSTRALIA Senior Research Assoc, Linguistic Data Consortium, University of Pennsylvania From baden at COMPULING.NET Thu Oct 3 10:24:37 2002 From: baden at COMPULING.NET (Baden Hughes) Date: Thu, 3 Oct 2002 20:24:37 +1000 Subject: Some comments on the LINGUIST service provider In-Reply-To: <200210030138.g931cmM07394@unagi.cis.upenn.edu> Message-ID: >From dealing with some new end users who have been introduced to OLAC via the Linguist interface, I've got a couple of related comments. Users would like to have a simple search - by title, author, description and subject language. This would mean author would be added to the existing Quick Search. There is a difference between the number of archives actively searched on the LL site and those registered at the OLAC site. I would have assumed automated harvesting of the new archives as they are registered at either location ? An ultra-low level comment, when you click on the link at the bottom of the LinguistList OLAC page: "If you would like to help with the OLAC enterprise, please let us know! Thank you in advance for your help! " An email message is launched, but there's no email address to send things to (ie mailto: is malformed). Baden From baden at COMPULING.NET Thu Oct 3 10:28:59 2002 From: baden at COMPULING.NET (Baden Hughes) Date: Thu, 3 Oct 2002 20:28:59 +1000 Subject: OLAC resources Message-ID: FWIW, the format.cpu, format.os and format.sourcecode schemas are available at http://www.compuling.net/projects/olac/ along with some other OLAC resources under development. Baden From baden at COMPULING.NET Thu Oct 3 12:14:29 2002 From: baden at COMPULING.NET (Baden Hughes) Date: Thu, 3 Oct 2002 22:14:29 +1000 Subject: experimental schema: format.sourcestatus In-Reply-To: <200209162213.g8GMDGL02117@unagi.cis.upenn.edu> Message-ID: Earlier I wrote to this list describing a problem I had found with the schemas format.* in that the did not necessarily describe a certain aspect of a software resource. I believe retaining the format.cpu, format.os and format.sourcecode vocabularies is beneficial. However, I would like to propose a new addition to these, namely a schema for "format.sourcestatus", which would be an optional controlled vocabulary, considered experimental only at this stage. The purpose of format.sourcestatus is to address two needs identified by end users as critical to being able to evaluate a software and determine its degree of utility to their own circumstances, eloquently expressed by Steven Bird as: > the end-user requirement here is to be able to answer the > question: "Can I run this software?" and > the end-user requirement here is to be able to answer the > question: "How much effort will be required to get this running?" In addressing these questions, format.sourcestatus is a controlled vocabulary that provides a range of descriptive options which assist the user in identifying whether or not they can use the software resource in question, and what additional requirements there will be to make it work. format.sourcestatus will contain enumeration values like the following: Pre-Compiled Binary Requires Compilation Requires Make Wrapped Installation Script There is a rudimentary draft of this available at: http://www.compuling.net/projects/olac/031002-draft-olac-format.sourcest atus.xsd (URL may wrap) It also occurs to me that format.sourcecode may not be the best name for the controlled vocabulary. In essence, the identification performed by this schema is of the language in which sourcecode is written. Any comments ? Baden From sb at CS.MU.OZ.AU Thu Oct 3 22:38:05 2002 From: sb at CS.MU.OZ.AU (Steven Bird) Date: Thu, 3 Oct 2002 18:38:05 EDT Subject: Some comments on the LINGUIST service provider In-Reply-To: Your mail dated Sunday 3 November, 2002. Message-ID: Helen Aristar Dry wrote: > But he suggests having a search blank, plus a full search. I guess I > just need to think about whether there's some way to do both what he > suggests and what you suggest. Would this work: a simple search page with a single keyword search field, and an advanced search page in which the most salient fields (e.g. Baden's list) appeared at the top? Further fields could be separated off from the main ones and/or be given in smaller type. Steven Bird From hdry at LINGUISTLIST.ORG Thu Oct 3 23:12:17 2002 From: hdry at LINGUISTLIST.ORG (Helen Aristar Dry) Date: Thu, 3 Oct 2002 19:12:17 -0400 Subject: Some comments on the LINGUIST service provider In-Reply-To: <200210032238.g93Mc6M09019@unagi.cis.upenn.edu> Message-ID: Good idea, Steven. Thanks. -Helen Date sent: Thu, 3 Oct 2002 18:38:05 EDT Send reply to: Steven Bird From: Steven Bird Organization: University of Melbourne Subject: Re: Some comments on the LINGUIST service provider To: OLAC-IMPLEMENTERS at LISTSERV.LINGUISTLIST.ORG > Helen Aristar Dry wrote: > > But he suggests having a search blank, plus a full search. I guess I > > just need to think about whether there's some way to do both what he > > suggests and what you suggest. > > Would this work: a simple search page with a single keyword search field, > and an advanced search page in which the most salient fields (e.g. Baden's > list) appeared at the top? Further fields could be separated off from the > main ones and/or be given in smaller type. > > Steven Bird From baden at COMPULING.NET Fri Oct 4 13:55:00 2002 From: baden at COMPULING.NET (Baden Hughes) Date: Fri, 4 Oct 2002 23:55:00 +1000 Subject: experimental schema:type.functionality Message-ID: The purpose of type.functionality is to describe the functionality of a software resource. There is a rudimentary draft of this available at: http://www.compuling.net/projects/olac/041002-draft-olac-type.functional ity.xsd (URL may wrap) This is based on the categorization from the HLT Survey at http://cslu.cse.ogi.edu/HLTsurvey/ Baden From sb at CS.MU.OZ.AU Mon Oct 7 02:34:22 2002 From: sb at CS.MU.OZ.AU (Steven Bird) Date: Sun, 6 Oct 2002 22:34:22 EDT Subject: experimental schema: format.sourcestatus In-Reply-To: Your mail dated Thursday 3 October, 2002. Message-ID: Last week Baden Hughes presented a new encoding scheme called source status. Here are some initial comments: > Pre-Compiled Binary or just "binary"? > Requires Compilation > Requires Make > Wrapped Installation These three are closely related - a build is required, and the difference is in how much work the person has to do. > Script So a simple starting point here would be to have a three-way distinction between binary, interpreted and compiled. [Aside: In all three cases, other packages may need to be downloaded, built and installed before the software can be run, and these will need to be documented using the relation.requires element/refinement. Presumably we won't bother specifying that a C compiler is required for a resource that is specified as being in the C language, unless a particular compiler/version is required.] Notice that the distinction between interpreted and compiled is largely predictable from the source language, and that the source code might not actually be provided. Therefore, we want to focus not on the source code, but the nature of the distribution (format.distribution?). Obviously, this now applies to data as well as software, since data can come in binary or source forms, with our without wrapping. The distribution methods include archives (tar, zip, rpm) which may be compressed, and may be self-extracting or require other software. The self-extracting kind might actually manage the download and registration process, as in the case of the CSLU toolkit. To some extent, the distribution method is predicable from the MIME type of the file, which weakens the case for special treatment of distribution types. An orthogonal issue is size: can I download this over a modem line? Anyway, to move things forward here, we may need to do some more study of end-user needs. -Steven -- Steven Bird Email: Web: http://www.cs.mu.oz.au/~sb/ A/Prof, Dept of Computer Science, University of Melbourne, Vic 3010, AUSTRALIA Senior Research Assoc, Linguistic Data Consortium, University of Pennsylvania From sb at CS.MU.OZ.AU Fri Oct 18 02:11:48 2002 From: sb at CS.MU.OZ.AU (Steven Bird) Date: Thu, 17 Oct 2002 22:11:48 EDT Subject: Local arrangements in Philadelphia Message-ID: Folks, I have now set up a website for the workshop at: http://www.language-archives.org/events/olac02/ The most important information it contains now is the list of confirmed participants and the arrangements for booking your hotel room. Note that we are paying for hotel rooms for the confirmed participants (except local participants). Please call the hotel to make your booking, using one of the numbers on the website. Please contact Laurel Sweeney at Penn if you encounter any problems with the booking process. Information about the workshop program will be posted next week. Others who wish to attend need to contact me as soon as possible please. Thanks, -Steven -- Steven Bird Email: Web: http://www.cs.mu.oz.au/~sb/ A/Prof, Dept of Computer Science, University of Melbourne, Vic 3010, AUSTRALIA Senior Research Assoc, Linguistic Data Consortium, University of Pennsylvania From sb at CS.MU.OZ.AU Wed Oct 23 10:26:21 2002 From: sb at CS.MU.OZ.AU (Steven Bird) Date: Wed, 23 Oct 2002 06:26:21 EDT Subject: workshop program Message-ID: Folks, I'm sorry that the workshop program is long overdue. There is a lot to cover, and Gary and I would like to solicit your input on priorities, and on the contributions of each participant. We think the top level goals are: 1. to effect the transition to the operational phase of OLAC 2. to set the agenda for the coming year 3. to foster ongoing collaboration amongst the participants in the pursuit of the above In support of these goals, the primary workshop activities need to be: 1. presenting and reviewing all the standards, understanding the implementation issues, and releasing version 1.0 2. finalizing, testing and documenting key recommendations - the metadata vocabularies 3. evaluating the community infrastructure - website, services, documentation Here then is a comprehensive overview of the OLAC infrastructure, both existing and planned, along with various suggestions about what we could accomplish before/during the workshop, and who could possibly take the lead in doing or delegating the work. There is a lot here, but many items can be dispensed with quickly (e.g. a 10 minute report), while some big things that are beyond the scope of our workshop can be put on the agenda of a working group for 2003. I hope that the work will be shared around, so that everyone has significant activites to do in the remaining six weeks. So please suggest priorities, identify any omissions, and volunteer to work on something. I'll convert this into a provisional program by the start of next week. Thanks, -Steven ---- Annotations: feedback: feedback requested before workshop overview: a short presentation (10 minutes) presentation: full presentation (20-30 minutes) wg: working group(s) will process this 1. STANDARDS (Tuesday) All of these need to be presented on day 1 (even if briefly) to make sure there is enough time for feedback and consensus building if any issues do arise. a) OLAC-Process [feedback, overview] - Gary Simons? * present and discuss at start of workshop because it defines how we will operate even during the workshop b) OLAC-PMH [overview, wg] - Steven Bird? * the primary issue will be the transition from OAI 1.1 to 2.0 * those who implement data providers to discuss c) OLAC Metadata Format [feedback, presentation, wg] - Steven Bird? * new work on representing OLAC metadata in XML * more information will be circulated this week * those who implement data providers to discuss d) OLAC Metadata Extension Mechanism [presentation, wg] - Steven Bird? * how to express a vocabulary in a harvestable schema fragment * those who implement 3rd party extensions to discuss 2. RECOMMENDATIONS (Tuesday/Wednesday) These are our vocabularies, along with any new proposals for recommendations (e.g. best practices for digitizing audio recordings). a) OLAC-Language [overview] - Gary Simons?, Anthony Aristar? b) OLAC-Linguistic-Type [feedback, overview, wg?] - Heidi Johnson?, Helen Aristar Dry? * a vocabulary document to be circulated before the workshop * participants to apply the terms to their repository * the working group meeting may not be necessary c) OLAC-Linguistic-Fields [feedback, overview] - Helen Aristar Dry? * a vocabulary document to be circulated before the workshop * participants to apply the terms to their repository d) OLAC-Role: [feedback, overview, wg] - Heidi Johnson? * a vocabulary document to be circulated before the workshop * participants to apply the terms to their repository * still need to consider roles in the creation of language technologies and corpus publications e) OLAC-Rights: [feedback, overview, wg] - Heidi Johnson?, Steven Bird? Other vocabularies to consider OLAC-Encoding, OLAC-Format, OLAC-Functionality. Time to be given to testing the vocabularies on existing repositories. 3. ARCHIVES AND SERVICES (Wednesday) a) review metadata quality for existing archives [feedback] b) OLAC website [feedback] c) Registration [overview] - Gary Simons? d) Vida/ORE/ORyX/OLACA/Viser [overview] * need to identify developers to help in 2003 e) LINGUIST [feedback, overview] - Helen Aristar Dry?, Anthony Aristar? 4. SUB-COMMUNITY EXTENSIONS (Wednesday) a) Language technology [feedback, overview, wg] - Baden Hughes? * vocabulary documents to be circulated before the workshop * work on vocabularies for OS, CPU, Sourcecode, Distribution b) Language documentation [overview, wg] - Heidi Johnson? * IMDI/OLAC mapping? * possible common vocabularies across IMDI and OLAC 5. IMPLEMENTATION NOTES (Wednesday/Thursday) Useful tools that people have developed: - exporting MS Access to ORyX files for Net-DC - Andrew Cole? - Net-DC experience - Khalid Choukri? - AILLA database model - Erik Grostic? 5. AGENDA FOR 2003 (Thursday) a) more best practices * there are many areas where we need best practice recommendations [http://www.ldc.upenn.edu/sb/home/publications.html#0204020] * who wants to pick a need and start working on a recommendation? b) more data providers * outreach, special needs, help with data providers * many subcommunities are creating resources * who wants to commit to helping them hook up with OLAC? + linguistics - accessible OLAC introduction - Jeff Good? + language technology + national archives + text archives + museum archives (e.g. 19C fieldwork materials) + antiquity (e.g. classical and ancient Near East text collections) + others? c) more service providers * regional services (e.g. Asia) * services tailored for research needs (e.g. typology) d) proposals for other work that needs to be done --end-- From sb at CS.MU.OZ.AU Thu Oct 31 06:41:07 2002 From: sb at CS.MU.OZ.AU (Steven Bird) Date: Thu, 31 Oct 2002 01:41:07 EST Subject: A simpler format for OLAC vocabularies and schemes Message-ID: About six weeks ago, Gary Simons and I presented a schematic outline for a new representation for OLAC metadata. We described a single extension mechanism that would provide better interoperability and extensiblity, with less administrative and technical infrastructure than before, with the goal of making it still easier for archives to participate in OLAC. About the same time we discovered very recent DCMI work on the XML representation of DC and DC qualifiers: Guidelines for implementing Dublin Core in XML http://dublincore.org/documents/2002/09/09/dc-xml-guidelines/ Recommendations for XML Schema for Qualified Dublin Core http://www.ukoln.ac.uk/metadata/dcmi/xmlschema/20021007/ These documents finally provide the DC XML framework that we had hoped to find way back in January 2001, when we first started working on an XML representation of our own Dublin Core qualifiers. In the intervening six weeks we have figured out a new format for OLAC metadata which implements our simplified extension mechanism, while simultaneously re-using the new schemas from the DCMI. REVIEW To recap briefly, here are three examples showing OLAC 0.4 metadata, the version in current use: Dschang Seediq Sapir, Ned The examples illustrate several points: (a) Element refinement: subject.language, editor (i.e. two different methods) (b) OLAC encoding scheme: code="xxx" (c) Free text element content, the escape hatch when OLAC codes don't fit (d) A third party encoding scheme: scheme="xxx" Here's the same information represented according to last month's proposal for a simplified extension mechanism: Dschang Sapir, Ned According to our proposal, this extension attribute would be used to express all refinements, vocabularies and schemes, whether originating from OLAC, an OLAC subcommunity, or an individual archive. These extensions wouldn't be centrally controlled, so individual archives and groups of archives could develop their own extensions without any community-wide approval process, and later demonstrate useful services based on their extension in order to promote it to the community at large. REVISED REPRESENTATION In the revised representation we are now proposing, the "extension" attribute is renamed "xsi:type", and its value is given a namespace prefix. For example, the above three elements would be rewritten as follows: Dschang Sapir, Ned This little change brings us into line with DCMI. No longer do we have to define DC and DC qualifiers ourselves, we can now simply import the DCMI Schemas directly. This means that OLAC metadata is not simply a semantic extension of DC metadata as in the past, but the OLAC metadata *format* is a *syntactic* extension of the DC metadata format. THE FILES The schemas are posted at: http://www.language-archives.org/OLAC/1.0b1/ The contents of the directory are as follows: 1. Example metadata record * olac.xml 2. Top level OLAC schema * olac.xsd 3. OLAC vocabularies (subject to approval at the December workshop) * olac-date.xsd * olac-language.xsd * olac-linguistic-field.xsd * olac-linguistic-type.xsd * olac-role.xsd 4. Hypothetical third-party extensions (to be hosted off-site) a) Academia Sinica Formosan language vocabulary * third-party/as-formosan.xml * third-party/as-formosan.xsd b) LT-World Human Language Technology vocabulary * third-party/ltworld-hlt-field.xml * third-party/ltworld-hlt-field.xsd c) Individual archive's own redefined OLAC vocabularies * third-party/myolac.xml * third-party/myolac.xsd d) Networking Data Centers' vocabulary (LDC/ELRA) * third-party/netdc.xml * third-party/netdc.xsd e) Software vocabularies * third-party/software.xml * third-party/software-cpu.xsd * third-party/software-os.xsd * third-party/software-sourcecode.xsd * third-party/software.xsd f) An example mixing three independent extensions * third-party/combined.xml TECHNICAL DISCUSSION (a) About xsi:type The xsi:type attribute is defined in the XML Schema standard. It is a directive to a schema validator, telling it to override the definition of the XML element with the named type definition. It uses the namespace declaration to find the schema fragment that defines the overriding type. Thus, the attribute xsi:type="olac:language" says: "take the DC definition of subject, add an optional "code" attribute, and restrict the code values to the range specified in the schema for olac:language. (b) Harvesting When harvesting these records, OLAC service providers will store OLAC and third-party metadata elements in the same way, using columns for the extension name (i.e. the value of the xsi:type attribute), for the code, and for the element content. In this way, coded values and element content will be searchable for both OLAC and third-party vocabularies alike. However, only OLAC vocabularies would have special services associated with them (e.g. the language codes service built into the LINGUIST service provider). The proposer of a new extension could set up their own service provider to demonstrate the value of their vocabulary in resource discovery and promote it to the whole OLAC community. (c) Dumb-down Dumb-down from a third-party extension to OLAC, and dumb-down from OLAC to DC, are straightforward to implement in this model. Full details will be circulated in a later message. (d) Application profiles An "application profile" is a hybrid metadata record that combines elements and attributes that come from multiple authorities [1,2]. Under the newly proposed approach, we can conceive of OLAC metadata as an application profile for the language resources community. When a third party wants to extend the OLAC application profile, they are actually creating a new application profile that combines DC and OLAC metadata elements and attributes, along with their own. [1] http://www.ariadne.ac.uk/issue25/app-profiles/ [2] http://dublincore.org/documents/library-application-profile/ (e) Copying the DCMI use of XML schemas The decision to copy the DCMI's use of XML Schemas has two unfortunate and unavoidable consequences. First, the XML representation of DC and OLAC metadata is tied to XML Schema validation. If the validation technology is ever changed, then the metadata format will need to be changed. Second, the xsi:type declarations are not constrained as to which DC element they appear on. If a metadata record used the role vocabulary on an inappropriate element such as title, then the schema validation would not report this error. These are problems with the implementation decisions made by the DC-Architecture Working Group, problems that we inherit. We feel that it is more important to conform to the DCMI and work with them to address these issues, rather than continuing to work in isolation. (f) Preserving a simple migration path The new proposal maintains the simple migration path that is currently permitted with OLAC 0.4. This is an important feature for new archives coming in to OLAC. The following sequence illustrates the migration path: Step 1: archive maps their topic descriptor to the DC subject element: prosody Step 2: archive uses the OLAC extension as a refinement, to state that the element content pertains to a linguistic field: prosody Step 3a: archive identifies the nearest OLAC code but retains their own data as a comment, to provide additional information: prosody OR Step 3b: archive persuades community to accept a new vocabulary item: Note that step 3a illustrates an escape hatch for archives that have a problem mapping their descriptors to OLAC vocabulary items. Note also that this approach represents a minor deviation from the DCMI approach, which puts coded values in the element content, leaving no room for comments. CONCLUSION The revised proposal differs minimally from the previous proposal: the "extension" element is renamed "xsi:type". We believe this proposal represents a significant improvement on the current OLAC 0.4 format in the areas of simplicity, interoperability and extensibility. Furthermore, it puts us squarely in the DC community: OLAC won't have to reimplement each new DC Qualifier that the DCMI adopts; OLAC can benefit from any software that works on DC metadata; and OLAC vocabularies can be easily adopted outside the OLAC community. With your approval, we will document this new format and bring it up at the December workshop as the proposal for OLAC version 1.0. Once adopted, each OLAC archive would be required to support it in order to participate in OLAC. Please send any comments to the list. Steven Bird & Gary Simons From Gary_Simons at SIL.ORG Thu Oct 31 13:31:44 2002 From: Gary_Simons at SIL.ORG (Gary Simons) Date: Thu, 31 Oct 2002 07:31:44 -0600 Subject: A simpler format for OLAC vocabularies and schemes Message-ID: On rereading our posting this morning, I realized that there is one major feature of the new approach that we failed to mention since we were so focused on explaining extensions to DC metadata. That issue is how the refinements that are already defined by DCMI will work. In OLAC 0.4, we used a "refine" attribute for the names of refinements defined in the Qualified DC recommendation. We made this up in the absence of any recommendation from DCMI as to how this should be implemented. If you look up the new DCMI documents referenced in the main posting, you will see that they have now addressed this issue, and their solution is to treat the refinements as tags in their own right, but they are from the "dcterms" namespace, rather than the "dc" namespace. Thus, this from OLAC 0.4: Orginal title Translated title would be the following in OLAC 1.0: Original title Translated title N.B. Since our new solution is an application profile, the 15 main metadata tags (like title in this example) are in the Dublin Core namespace rather than our own. In the examples that Steven has posted in the /OLAC/1.0b1/ directory, the Dublin Core namespace is declared to be the default namespace, so that the above is actually expressed as: Original title Translated title Anyway, I thought I should point out this difference between OLAC 0.4 and the proposed 1.0 since it, too, will have an impact on your implementation of data providers. -Gary Simons From sb at CS.MU.OZ.AU Tue Oct 1 07:33:05 2002 From: sb at CS.MU.OZ.AU (Steven Bird) Date: Tue, 1 Oct 2002 03:33:05 EDT Subject: Call for Participation: OLAC Workshop In-Reply-To: Your mail dated Thursday 8 August, 2002. Message-ID: Folks - the workshop is fast approaching; just over two months to go now. If you haven't already done so, please communicate your intention to participate to Gary and me, by replying to this email. We'll be circulating more details about the workshop soon. For now please take a look at the list of preparatory tasks from the original call, which I'm appending below. Thanks, Steven Bird > > WORKSHOP ON OPEN LANGUAGE ARCHIVES > Institute for Research in Cognitive Science (IRCS) > University of Pennsylvania, Philadelphia > December 10-12, 2002 > > Sponsored by the National Science Foundation project: > International Standards in Language Engineering (ISLE) > > > OLAC, the Open Language Archives Community, was founded at the > Workshop on Web-Based Language Documentation and Description, in > December 2000. During 2001, the OLAC development phase, the core > infrastructure for OLAC was built and alpha testers implemented data > providers. During 2002, the pilot phase, we froze the standards to > encourage wider adoption and experience with the metadata and the > protocol. At the close of 2002 we want to draw together all this > experience, make final revisions, and launch the operational phase. > With this launch, the OLAC standards will be promoted from "candidate" > to "adopted", and version 1.0 of the OLAC XML schemas will be released. > > > WORKSHOP GOALS > > The workshop will be tightly focussed on the following goals: > > 1. Standards: To revise the three proposed standards, the OLAC > Metadata Set, the OLAC Process document and the OLAC Protocol. > > 2. Vocabularies: To finalize the controlled vocabularies: linguistic > type, software functionality, rights, format, encoding, ... > > 3. Review: To give feedback to each participating archive on its use > of metadata, to review the services on the OLAC and LINGUIST sites. > > 4. Proposals: To hear new proposals for working groups, encoding > schemes, implementation notes and best practice recommendations, > and position papers on work that still needs to be done. > > In support of these goals, the workshop will consist of: > * group discussions, both plenary and in parallel working groups; > * review/editing of documents, both in working groups and in private; > * plus a limited number of presentations (cf goal 4). > > NB. No time will be allocated for project reports in the formal program. > > > PARTICIPATION > > The workshop is open to advisory board members and representatives of > participating archives, consistent with our core value of "Empowering > the Players" [http://www.language-archives.org/OLAC/process.html]. > > *** Please communicate your intention to participate by October 1. > > NB. If you have been thinking about becoming an OLAC data provider, now > would be a good time to act. Any archive that becomes a data provider > by October 1 will also be invited to participate in this foundation > setting workshop. For more information on becoming a data provider, > please see http://www.language-archives.org/docs/implement.html > > > SPONSORSHIP > > The workshop is being sponsored by the NSF ISLE project "International > Standards in Language Engineering". We have funding for accomodation > at the University Sheraton, a short walk from IRCS. No registration > fee will be charged. Some travel support may also be available. > > > PREPARATORY TASKS > > In order to ensure that the workshop achieves its goals, participants > will be expected to help create, review and edit draft documents ahead > of the meeting. We would like each person to contribute 1-2 days > each month to this effort from September onwards. The preparatory tasks > correspond to our workshop goals, and are as follows: > > 1. Standards: review all the standards documents and suggest revisions > > 2. Vocabularies: review some of the controlled vocabularies and > suggest revisions > > 3. Review: choose three participating archives besides your own and > suggest improvements to their use of metadata; review the > www.language-archives.org site and the www.linguistlist.org/olac/ > service and suggest improvements. > > 4. Proposals: draft an encoding scheme, an implementation note, a > best practice recommendation, or a proposal for anything else that > needs to be done, and present it to the group. > > The success of the workshop will depend on active participation in > these tasks. Comments circulated in advance will have the most impact > on our work. To facilitate the process we will use this list, > OLAC-Implementers, except where formal working groups have already > been established with their own lists. Note that OLAC-Implementers is > an open, unmoderated list, archived on the LINGUIST site at: > http://lists.linguistlist.org/archives/olac-implementers.html > > More information will be circulated in September. In the meantime, > please feel free to get started on any of the above tasks... > > Steven Bird & Gary Simons > > From sb at CS.MU.OZ.AU Thu Oct 3 01:38:48 2002 From: sb at CS.MU.OZ.AU (Steven Bird) Date: Wed, 2 Oct 2002 21:38:48 EDT Subject: Some comments on the LINGUIST service provider Message-ID: One of the workshop preparatory tasks is: > 3. Review: choose three participating archives besides your own and > suggest improvements to their use of metadata; review the > www.language-archives.org site and the www.linguistlist.org/olac/ > service and suggest improvements. I have three low-level comments on the LINGUIST service provider. I hope this feedback will make the service even better than it already is... a) The first page you come to is a long document with a search form some way down. I'd favor a very simple page (cf www.google.com) consisting of a search box, a link to the advanced search, and a link to "more about OLAC" which has all the original text. b) Users wanting "more powerful search" are directed to the "OLAC Query page". (Weren't we just on an OLAC query page?) Arriving on this new page, we see that it is called "OLAC Query Form: Simple Search". This is confusing, since we've just come from a simple search page expecting the more powerful search page, only to find that this is still only simple search. There's no pointer back to the really simple search. I'd prefer this to be called "Advanced Search" (both on the title and the incoming link), with a backpointer to the simple search. c) This second page points to yet another page, called Advanced Search. However, this generates an error: "ODBC Error Code = S1000 (General error) [TCX][MyODBC]Table 'OLAC.alltypes' doesn't exist". I expect this really advanced search permits search on all fields. I'm not convinced we need three levels of search. Could the second and third levels be collapsed into a single level, containing all the search fields? Does anyone else have comments on this service? -Steven -- Steven Bird Email: Web: http://www.cs.mu.oz.au/~sb/ A/Prof, Dept of Computer Science, University of Melbourne, Vic 3010, AUSTRALIA Senior Research Assoc, Linguistic Data Consortium, University of Pennsylvania From baden at COMPULING.NET Thu Oct 3 10:24:37 2002 From: baden at COMPULING.NET (Baden Hughes) Date: Thu, 3 Oct 2002 20:24:37 +1000 Subject: Some comments on the LINGUIST service provider In-Reply-To: <200210030138.g931cmM07394@unagi.cis.upenn.edu> Message-ID: >From dealing with some new end users who have been introduced to OLAC via the Linguist interface, I've got a couple of related comments. Users would like to have a simple search - by title, author, description and subject language. This would mean author would be added to the existing Quick Search. There is a difference between the number of archives actively searched on the LL site and those registered at the OLAC site. I would have assumed automated harvesting of the new archives as they are registered at either location ? An ultra-low level comment, when you click on the link at the bottom of the LinguistList OLAC page: "If you would like to help with the OLAC enterprise, please let us know! Thank you in advance for your help! " An email message is launched, but there's no email address to send things to (ie mailto: is malformed). Baden From baden at COMPULING.NET Thu Oct 3 10:28:59 2002 From: baden at COMPULING.NET (Baden Hughes) Date: Thu, 3 Oct 2002 20:28:59 +1000 Subject: OLAC resources Message-ID: FWIW, the format.cpu, format.os and format.sourcecode schemas are available at http://www.compuling.net/projects/olac/ along with some other OLAC resources under development. Baden From baden at COMPULING.NET Thu Oct 3 12:14:29 2002 From: baden at COMPULING.NET (Baden Hughes) Date: Thu, 3 Oct 2002 22:14:29 +1000 Subject: experimental schema: format.sourcestatus In-Reply-To: <200209162213.g8GMDGL02117@unagi.cis.upenn.edu> Message-ID: Earlier I wrote to this list describing a problem I had found with the schemas format.* in that the did not necessarily describe a certain aspect of a software resource. I believe retaining the format.cpu, format.os and format.sourcecode vocabularies is beneficial. However, I would like to propose a new addition to these, namely a schema for "format.sourcestatus", which would be an optional controlled vocabulary, considered experimental only at this stage. The purpose of format.sourcestatus is to address two needs identified by end users as critical to being able to evaluate a software and determine its degree of utility to their own circumstances, eloquently expressed by Steven Bird as: > the end-user requirement here is to be able to answer the > question: "Can I run this software?" and > the end-user requirement here is to be able to answer the > question: "How much effort will be required to get this running?" In addressing these questions, format.sourcestatus is a controlled vocabulary that provides a range of descriptive options which assist the user in identifying whether or not they can use the software resource in question, and what additional requirements there will be to make it work. format.sourcestatus will contain enumeration values like the following: Pre-Compiled Binary Requires Compilation Requires Make Wrapped Installation Script There is a rudimentary draft of this available at: http://www.compuling.net/projects/olac/031002-draft-olac-format.sourcest atus.xsd (URL may wrap) It also occurs to me that format.sourcecode may not be the best name for the controlled vocabulary. In essence, the identification performed by this schema is of the language in which sourcecode is written. Any comments ? Baden From sb at CS.MU.OZ.AU Thu Oct 3 22:38:05 2002 From: sb at CS.MU.OZ.AU (Steven Bird) Date: Thu, 3 Oct 2002 18:38:05 EDT Subject: Some comments on the LINGUIST service provider In-Reply-To: Your mail dated Sunday 3 November, 2002. Message-ID: Helen Aristar Dry wrote: > But he suggests having a search blank, plus a full search. I guess I > just need to think about whether there's some way to do both what he > suggests and what you suggest. Would this work: a simple search page with a single keyword search field, and an advanced search page in which the most salient fields (e.g. Baden's list) appeared at the top? Further fields could be separated off from the main ones and/or be given in smaller type. Steven Bird From hdry at LINGUISTLIST.ORG Thu Oct 3 23:12:17 2002 From: hdry at LINGUISTLIST.ORG (Helen Aristar Dry) Date: Thu, 3 Oct 2002 19:12:17 -0400 Subject: Some comments on the LINGUIST service provider In-Reply-To: <200210032238.g93Mc6M09019@unagi.cis.upenn.edu> Message-ID: Good idea, Steven. Thanks. -Helen Date sent: Thu, 3 Oct 2002 18:38:05 EDT Send reply to: Steven Bird From: Steven Bird Organization: University of Melbourne Subject: Re: Some comments on the LINGUIST service provider To: OLAC-IMPLEMENTERS at LISTSERV.LINGUISTLIST.ORG > Helen Aristar Dry wrote: > > But he suggests having a search blank, plus a full search. I guess I > > just need to think about whether there's some way to do both what he > > suggests and what you suggest. > > Would this work: a simple search page with a single keyword search field, > and an advanced search page in which the most salient fields (e.g. Baden's > list) appeared at the top? Further fields could be separated off from the > main ones and/or be given in smaller type. > > Steven Bird From baden at COMPULING.NET Fri Oct 4 13:55:00 2002 From: baden at COMPULING.NET (Baden Hughes) Date: Fri, 4 Oct 2002 23:55:00 +1000 Subject: experimental schema:type.functionality Message-ID: The purpose of type.functionality is to describe the functionality of a software resource. There is a rudimentary draft of this available at: http://www.compuling.net/projects/olac/041002-draft-olac-type.functional ity.xsd (URL may wrap) This is based on the categorization from the HLT Survey at http://cslu.cse.ogi.edu/HLTsurvey/ Baden From sb at CS.MU.OZ.AU Mon Oct 7 02:34:22 2002 From: sb at CS.MU.OZ.AU (Steven Bird) Date: Sun, 6 Oct 2002 22:34:22 EDT Subject: experimental schema: format.sourcestatus In-Reply-To: Your mail dated Thursday 3 October, 2002. Message-ID: Last week Baden Hughes presented a new encoding scheme called source status. Here are some initial comments: > Pre-Compiled Binary or just "binary"? > Requires Compilation > Requires Make > Wrapped Installation These three are closely related - a build is required, and the difference is in how much work the person has to do. > Script So a simple starting point here would be to have a three-way distinction between binary, interpreted and compiled. [Aside: In all three cases, other packages may need to be downloaded, built and installed before the software can be run, and these will need to be documented using the relation.requires element/refinement. Presumably we won't bother specifying that a C compiler is required for a resource that is specified as being in the C language, unless a particular compiler/version is required.] Notice that the distinction between interpreted and compiled is largely predictable from the source language, and that the source code might not actually be provided. Therefore, we want to focus not on the source code, but the nature of the distribution (format.distribution?). Obviously, this now applies to data as well as software, since data can come in binary or source forms, with our without wrapping. The distribution methods include archives (tar, zip, rpm) which may be compressed, and may be self-extracting or require other software. The self-extracting kind might actually manage the download and registration process, as in the case of the CSLU toolkit. To some extent, the distribution method is predicable from the MIME type of the file, which weakens the case for special treatment of distribution types. An orthogonal issue is size: can I download this over a modem line? Anyway, to move things forward here, we may need to do some more study of end-user needs. -Steven -- Steven Bird Email: Web: http://www.cs.mu.oz.au/~sb/ A/Prof, Dept of Computer Science, University of Melbourne, Vic 3010, AUSTRALIA Senior Research Assoc, Linguistic Data Consortium, University of Pennsylvania From sb at CS.MU.OZ.AU Fri Oct 18 02:11:48 2002 From: sb at CS.MU.OZ.AU (Steven Bird) Date: Thu, 17 Oct 2002 22:11:48 EDT Subject: Local arrangements in Philadelphia Message-ID: Folks, I have now set up a website for the workshop at: http://www.language-archives.org/events/olac02/ The most important information it contains now is the list of confirmed participants and the arrangements for booking your hotel room. Note that we are paying for hotel rooms for the confirmed participants (except local participants). Please call the hotel to make your booking, using one of the numbers on the website. Please contact Laurel Sweeney at Penn if you encounter any problems with the booking process. Information about the workshop program will be posted next week. Others who wish to attend need to contact me as soon as possible please. Thanks, -Steven -- Steven Bird Email: Web: http://www.cs.mu.oz.au/~sb/ A/Prof, Dept of Computer Science, University of Melbourne, Vic 3010, AUSTRALIA Senior Research Assoc, Linguistic Data Consortium, University of Pennsylvania From sb at CS.MU.OZ.AU Wed Oct 23 10:26:21 2002 From: sb at CS.MU.OZ.AU (Steven Bird) Date: Wed, 23 Oct 2002 06:26:21 EDT Subject: workshop program Message-ID: Folks, I'm sorry that the workshop program is long overdue. There is a lot to cover, and Gary and I would like to solicit your input on priorities, and on the contributions of each participant. We think the top level goals are: 1. to effect the transition to the operational phase of OLAC 2. to set the agenda for the coming year 3. to foster ongoing collaboration amongst the participants in the pursuit of the above In support of these goals, the primary workshop activities need to be: 1. presenting and reviewing all the standards, understanding the implementation issues, and releasing version 1.0 2. finalizing, testing and documenting key recommendations - the metadata vocabularies 3. evaluating the community infrastructure - website, services, documentation Here then is a comprehensive overview of the OLAC infrastructure, both existing and planned, along with various suggestions about what we could accomplish before/during the workshop, and who could possibly take the lead in doing or delegating the work. There is a lot here, but many items can be dispensed with quickly (e.g. a 10 minute report), while some big things that are beyond the scope of our workshop can be put on the agenda of a working group for 2003. I hope that the work will be shared around, so that everyone has significant activites to do in the remaining six weeks. So please suggest priorities, identify any omissions, and volunteer to work on something. I'll convert this into a provisional program by the start of next week. Thanks, -Steven ---- Annotations: feedback: feedback requested before workshop overview: a short presentation (10 minutes) presentation: full presentation (20-30 minutes) wg: working group(s) will process this 1. STANDARDS (Tuesday) All of these need to be presented on day 1 (even if briefly) to make sure there is enough time for feedback and consensus building if any issues do arise. a) OLAC-Process [feedback, overview] - Gary Simons? * present and discuss at start of workshop because it defines how we will operate even during the workshop b) OLAC-PMH [overview, wg] - Steven Bird? * the primary issue will be the transition from OAI 1.1 to 2.0 * those who implement data providers to discuss c) OLAC Metadata Format [feedback, presentation, wg] - Steven Bird? * new work on representing OLAC metadata in XML * more information will be circulated this week * those who implement data providers to discuss d) OLAC Metadata Extension Mechanism [presentation, wg] - Steven Bird? * how to express a vocabulary in a harvestable schema fragment * those who implement 3rd party extensions to discuss 2. RECOMMENDATIONS (Tuesday/Wednesday) These are our vocabularies, along with any new proposals for recommendations (e.g. best practices for digitizing audio recordings). a) OLAC-Language [overview] - Gary Simons?, Anthony Aristar? b) OLAC-Linguistic-Type [feedback, overview, wg?] - Heidi Johnson?, Helen Aristar Dry? * a vocabulary document to be circulated before the workshop * participants to apply the terms to their repository * the working group meeting may not be necessary c) OLAC-Linguistic-Fields [feedback, overview] - Helen Aristar Dry? * a vocabulary document to be circulated before the workshop * participants to apply the terms to their repository d) OLAC-Role: [feedback, overview, wg] - Heidi Johnson? * a vocabulary document to be circulated before the workshop * participants to apply the terms to their repository * still need to consider roles in the creation of language technologies and corpus publications e) OLAC-Rights: [feedback, overview, wg] - Heidi Johnson?, Steven Bird? Other vocabularies to consider OLAC-Encoding, OLAC-Format, OLAC-Functionality. Time to be given to testing the vocabularies on existing repositories. 3. ARCHIVES AND SERVICES (Wednesday) a) review metadata quality for existing archives [feedback] b) OLAC website [feedback] c) Registration [overview] - Gary Simons? d) Vida/ORE/ORyX/OLACA/Viser [overview] * need to identify developers to help in 2003 e) LINGUIST [feedback, overview] - Helen Aristar Dry?, Anthony Aristar? 4. SUB-COMMUNITY EXTENSIONS (Wednesday) a) Language technology [feedback, overview, wg] - Baden Hughes? * vocabulary documents to be circulated before the workshop * work on vocabularies for OS, CPU, Sourcecode, Distribution b) Language documentation [overview, wg] - Heidi Johnson? * IMDI/OLAC mapping? * possible common vocabularies across IMDI and OLAC 5. IMPLEMENTATION NOTES (Wednesday/Thursday) Useful tools that people have developed: - exporting MS Access to ORyX files for Net-DC - Andrew Cole? - Net-DC experience - Khalid Choukri? - AILLA database model - Erik Grostic? 5. AGENDA FOR 2003 (Thursday) a) more best practices * there are many areas where we need best practice recommendations [http://www.ldc.upenn.edu/sb/home/publications.html#0204020] * who wants to pick a need and start working on a recommendation? b) more data providers * outreach, special needs, help with data providers * many subcommunities are creating resources * who wants to commit to helping them hook up with OLAC? + linguistics - accessible OLAC introduction - Jeff Good? + language technology + national archives + text archives + museum archives (e.g. 19C fieldwork materials) + antiquity (e.g. classical and ancient Near East text collections) + others? c) more service providers * regional services (e.g. Asia) * services tailored for research needs (e.g. typology) d) proposals for other work that needs to be done --end-- From sb at CS.MU.OZ.AU Thu Oct 31 06:41:07 2002 From: sb at CS.MU.OZ.AU (Steven Bird) Date: Thu, 31 Oct 2002 01:41:07 EST Subject: A simpler format for OLAC vocabularies and schemes Message-ID: About six weeks ago, Gary Simons and I presented a schematic outline for a new representation for OLAC metadata. We described a single extension mechanism that would provide better interoperability and extensiblity, with less administrative and technical infrastructure than before, with the goal of making it still easier for archives to participate in OLAC. About the same time we discovered very recent DCMI work on the XML representation of DC and DC qualifiers: Guidelines for implementing Dublin Core in XML http://dublincore.org/documents/2002/09/09/dc-xml-guidelines/ Recommendations for XML Schema for Qualified Dublin Core http://www.ukoln.ac.uk/metadata/dcmi/xmlschema/20021007/ These documents finally provide the DC XML framework that we had hoped to find way back in January 2001, when we first started working on an XML representation of our own Dublin Core qualifiers. In the intervening six weeks we have figured out a new format for OLAC metadata which implements our simplified extension mechanism, while simultaneously re-using the new schemas from the DCMI. REVIEW To recap briefly, here are three examples showing OLAC 0.4 metadata, the version in current use: Dschang Seediq Sapir, Ned The examples illustrate several points: (a) Element refinement: subject.language, editor (i.e. two different methods) (b) OLAC encoding scheme: code="xxx" (c) Free text element content, the escape hatch when OLAC codes don't fit (d) A third party encoding scheme: scheme="xxx" Here's the same information represented according to last month's proposal for a simplified extension mechanism: Dschang Sapir, Ned According to our proposal, this extension attribute would be used to express all refinements, vocabularies and schemes, whether originating from OLAC, an OLAC subcommunity, or an individual archive. These extensions wouldn't be centrally controlled, so individual archives and groups of archives could develop their own extensions without any community-wide approval process, and later demonstrate useful services based on their extension in order to promote it to the community at large. REVISED REPRESENTATION In the revised representation we are now proposing, the "extension" attribute is renamed "xsi:type", and its value is given a namespace prefix. For example, the above three elements would be rewritten as follows: Dschang Sapir, Ned This little change brings us into line with DCMI. No longer do we have to define DC and DC qualifiers ourselves, we can now simply import the DCMI Schemas directly. This means that OLAC metadata is not simply a semantic extension of DC metadata as in the past, but the OLAC metadata *format* is a *syntactic* extension of the DC metadata format. THE FILES The schemas are posted at: http://www.language-archives.org/OLAC/1.0b1/ The contents of the directory are as follows: 1. Example metadata record * olac.xml 2. Top level OLAC schema * olac.xsd 3. OLAC vocabularies (subject to approval at the December workshop) * olac-date.xsd * olac-language.xsd * olac-linguistic-field.xsd * olac-linguistic-type.xsd * olac-role.xsd 4. Hypothetical third-party extensions (to be hosted off-site) a) Academia Sinica Formosan language vocabulary * third-party/as-formosan.xml * third-party/as-formosan.xsd b) LT-World Human Language Technology vocabulary * third-party/ltworld-hlt-field.xml * third-party/ltworld-hlt-field.xsd c) Individual archive's own redefined OLAC vocabularies * third-party/myolac.xml * third-party/myolac.xsd d) Networking Data Centers' vocabulary (LDC/ELRA) * third-party/netdc.xml * third-party/netdc.xsd e) Software vocabularies * third-party/software.xml * third-party/software-cpu.xsd * third-party/software-os.xsd * third-party/software-sourcecode.xsd * third-party/software.xsd f) An example mixing three independent extensions * third-party/combined.xml TECHNICAL DISCUSSION (a) About xsi:type The xsi:type attribute is defined in the XML Schema standard. It is a directive to a schema validator, telling it to override the definition of the XML element with the named type definition. It uses the namespace declaration to find the schema fragment that defines the overriding type. Thus, the attribute xsi:type="olac:language" says: "take the DC definition of subject, add an optional "code" attribute, and restrict the code values to the range specified in the schema for olac:language. (b) Harvesting When harvesting these records, OLAC service providers will store OLAC and third-party metadata elements in the same way, using columns for the extension name (i.e. the value of the xsi:type attribute), for the code, and for the element content. In this way, coded values and element content will be searchable for both OLAC and third-party vocabularies alike. However, only OLAC vocabularies would have special services associated with them (e.g. the language codes service built into the LINGUIST service provider). The proposer of a new extension could set up their own service provider to demonstrate the value of their vocabulary in resource discovery and promote it to the whole OLAC community. (c) Dumb-down Dumb-down from a third-party extension to OLAC, and dumb-down from OLAC to DC, are straightforward to implement in this model. Full details will be circulated in a later message. (d) Application profiles An "application profile" is a hybrid metadata record that combines elements and attributes that come from multiple authorities [1,2]. Under the newly proposed approach, we can conceive of OLAC metadata as an application profile for the language resources community. When a third party wants to extend the OLAC application profile, they are actually creating a new application profile that combines DC and OLAC metadata elements and attributes, along with their own. [1] http://www.ariadne.ac.uk/issue25/app-profiles/ [2] http://dublincore.org/documents/library-application-profile/ (e) Copying the DCMI use of XML schemas The decision to copy the DCMI's use of XML Schemas has two unfortunate and unavoidable consequences. First, the XML representation of DC and OLAC metadata is tied to XML Schema validation. If the validation technology is ever changed, then the metadata format will need to be changed. Second, the xsi:type declarations are not constrained as to which DC element they appear on. If a metadata record used the role vocabulary on an inappropriate element such as title, then the schema validation would not report this error. These are problems with the implementation decisions made by the DC-Architecture Working Group, problems that we inherit. We feel that it is more important to conform to the DCMI and work with them to address these issues, rather than continuing to work in isolation. (f) Preserving a simple migration path The new proposal maintains the simple migration path that is currently permitted with OLAC 0.4. This is an important feature for new archives coming in to OLAC. The following sequence illustrates the migration path: Step 1: archive maps their topic descriptor to the DC subject element: prosody Step 2: archive uses the OLAC extension as a refinement, to state that the element content pertains to a linguistic field: prosody Step 3a: archive identifies the nearest OLAC code but retains their own data as a comment, to provide additional information: prosody OR Step 3b: archive persuades community to accept a new vocabulary item: Note that step 3a illustrates an escape hatch for archives that have a problem mapping their descriptors to OLAC vocabulary items. Note also that this approach represents a minor deviation from the DCMI approach, which puts coded values in the element content, leaving no room for comments. CONCLUSION The revised proposal differs minimally from the previous proposal: the "extension" element is renamed "xsi:type". We believe this proposal represents a significant improvement on the current OLAC 0.4 format in the areas of simplicity, interoperability and extensibility. Furthermore, it puts us squarely in the DC community: OLAC won't have to reimplement each new DC Qualifier that the DCMI adopts; OLAC can benefit from any software that works on DC metadata; and OLAC vocabularies can be easily adopted outside the OLAC community. With your approval, we will document this new format and bring it up at the December workshop as the proposal for OLAC version 1.0. Once adopted, each OLAC archive would be required to support it in order to participate in OLAC. Please send any comments to the list. Steven Bird & Gary Simons From Gary_Simons at SIL.ORG Thu Oct 31 13:31:44 2002 From: Gary_Simons at SIL.ORG (Gary Simons) Date: Thu, 31 Oct 2002 07:31:44 -0600 Subject: A simpler format for OLAC vocabularies and schemes Message-ID: On rereading our posting this morning, I realized that there is one major feature of the new approach that we failed to mention since we were so focused on explaining extensions to DC metadata. That issue is how the refinements that are already defined by DCMI will work. In OLAC 0.4, we used a "refine" attribute for the names of refinements defined in the Qualified DC recommendation. We made this up in the absence of any recommendation from DCMI as to how this should be implemented. If you look up the new DCMI documents referenced in the main posting, you will see that they have now addressed this issue, and their solution is to treat the refinements as tags in their own right, but they are from the "dcterms" namespace, rather than the "dc" namespace. Thus, this from OLAC 0.4: Orginal title Translated title would be the following in OLAC 1.0: Original title Translated title N.B. Since our new solution is an application profile, the 15 main metadata tags (like title in this example) are in the Dublin Core namespace rather than our own. In the examples that Steven has posted in the /OLAC/1.0b1/ directory, the Dublin Core namespace is declared to be the default namespace, so that the above is actually expressed as: Original title Translated title Anyway, I thought I should point out this difference between OLAC 0.4 and the proposed 1.0 since it, too, will have an impact on your implementation of data providers. -Gary Simons