possible new metrics for metadata quality

Baden Hughes badenh at CS.MU.OZ.AU
Wed Aug 4 03:19:42 UTC 2004


Gary Simons and I recently discussed a number of new and revised metrics
for metadata quality in response to the deployment of the archive report cards
for OLAC. Here I attempt to describe these, and seek comment on them from
the wider OLAC implementers community before adding them to the feature
list for a subsequent version of the archive report cards.

We started the discussion from the perspective that there are certain
features that OLAC values inherently, and then attempted to derive metrics
which measure how well an archive meets those values.

*Data Available Online
OLAC values the availability of language data in an online accessible
format. For the most part, we can compute a metric for this based on the
presence or absence of a URI for location. Thus any resource which is
available online should be scored higher than resources which are
only available offline.

* Use of OLAC Vocabularies
The controlled vocabularies distinguish OLAC from other Dublin Core
communities, and provide the value added component of this metadata from a
linguistic perspective. Thus the efficient use of OLAC vocabularies in all
possible circumstances should score highly on a quality metric.

* Scope of the Vocabulary
Controlled vocabularies which make fine grained, linguistically intuitive
distinctions are more useful for information retrieval tasks that those
which provide broader but generic information. Thus the use of controlled
vocabularies which are strongly linguistically grounded should be scored
more highly than those which are general.

* Quality of an Archive's Metadata
It is possible to assess metadata on a per record basis and in aggregate
against a baseline optimimum. Consistent use of metadata elements should be scored more highly than
randomised use. The overall distribution of metadata can be scored based
on the proportion of metadata records which individually score highly.

* Quality of the Archive's Collection
In addition to simple aggregation of the quality of metadata for all
records within an archive, there are other metrics which also assist with
the assessment of the quality of an archive's collection. Thus the Breadth
of Archive Coverage would be equal to the number of
languages represented within an archive divided by the number of languages
in the Ethnologue. Following, the Depth of Archive Coverage would be equal
to the number of records in the archive divided by the number of languages
represented within that archive.

In addition to these values, simple surveying techniques such as counting
the useage of various elements are of value to archive maintainers who
seek to improve the quality of their metadata, and in turn, to the wider
community who uses the collection of archives as a starting point for
searching for language data.

Additional metrics which have occurred recently and could also be considered include the following:

- use of best practice or enduring data formats (eg XML vs DOC)
- collection of related materials and types (eg dictionaries, grammars,
recordings for the same language)

The comments of the list membership are, as always, most welcome.

Regards

Baden



More information about the Olac-implementers mailing list