query about format.sourcecode

Mon Sep 16 22:13:15 UTC 2002

Baden Hughes <baden at compuling.net> wrote:
> I've got a query about matters related to the element format.sourcecode

Its good to see discussion of software resources for a change, and I hope
the maintainers of software archives (DFKI, TRACTOR) will contribute to
this discussion.

> Currently the spec at http://www.language-archives.org/OLAC/olacms.html
> assumes that software resources indexed by OLAC will be in source code
> (and hence appropriate entries will be made under this tagset).

Not quite - all OLAC elements are optional, and some elements are simply
inappropriate for some resources.  Software distributed in binary form only
doesn't need to be given any sourcecode descriptor.

> The recommendation is currently:
>
> <format.sourcecode
> code="PROGRAMMING_LANGUAGE">Comments</format.sourcecode>
>
> There are several questions I have about this.
>
> 1) Do we need to clarify this even further as there are apparently two
> distinct options from the archive contents I've been working with). One
> is where the sourcecode requires compilation, the other is where
> sourcecode is essentially a script (or series of scripts). Any
> information about the "state" of the source code is likely to be
> inconsistent at best across archives, and I suspect even within a single
> archive. IMHO its relatively important to the end user of the OLAC
> search engine as to what state the sourcecode is in (ie how applicable
> is this code to the platforms I have access to).

Good, so the end-user requirement here is to be able to answer the
question: "Can I run this software?"

> 2) In the case where software resources indexed by OLAC are distributed
> in compiled form (ie not sourcecode) there's apparently not much more
> room to encode this information either. Apart from not strictly being
> something which belongs in a format.sourcecode element, the
> recommendation I assume would be that you could standardise this again
> by using the comment field, but the same consistency problem arises.
> Again, IMHO its relatively important to the end user of the OLAC search
> engine as to what state the sourcecode is in (ie can I just install and
> run or is it more complex)

Right, so the end-user requirement here is to be able to answer the
question: "How much effort will be required to get this running?"

> These two points may not represent large issues, but if the archives you
> are dealing with have a lot of software which ranges from source scripts
> in a range of languages, source for compilation for a range of
> compilers, and compiled "ready to run" applications, the granularity of
> this markup can be important and greatly assist with classification and
> indexation of resources in an appropriate manner. Additionally, for the
> less computer literate end users, this distinction is very important in
> them effectively locating a resource which is appropriate to their
> needs.

Absolutely.  Currently we have vocabularies for Sourcecode, CPU, and OS.
However, we can modify of scrap them if they don't serve our needs for
resource description and discovery.  Perhaps we need a new vocabulary
that better describes the state of the sourcecode.

One way to proceed here is for Baden (and any others) to identify the full
range of end-user requirements (is it more than these two?) then propose
vocabularies that best serve these requirements...

-Steven

--
Steven.Bird at ldc.upenn.edu  http://www.ldc.upenn.edu/sb
Assoc Director, LDC; Adj Assoc Prof, CIS & Linguistics
Linguistic Data Consortium, University of Pennsylvania
3600 Market St, Suite 810, Philadelphia, PA 19104-2653