updated OLAC Archive Report Cards

Wed Aug 4 03:00:11 UTC 2004

First, apologies for the delay in replying to the list. Gary and I met up
at a recent EMELD meeting, and discussed this posting and related issues at some
length, and I'm extracting the relevant parts of our conversation here as
a response to the posting itself.

A number of Gary's points have been addressed through the posting of
revised and consistent documentation, which is now available from each
archive report.

> I'm looking at the documentation page and find that the explanation in "2.
> Star Rating" isn't enough information to make it clear.  That says it is
> based on the "average item score out of 10".  I'm not sure what that means.
> A natural conclusion would be that each of the remaining outline points in
> the document is an item, and the overall rating is based on the average of
> those.  But I don't think that is what it means, since those don't describe
> things scored on the basis of 0 to 10.
> 10 point scoring seems to appear only in "4. Metadata Quality", and that
> section does talk about "items", so is it the case that the star rating
> deals only with point 4 on metadata quality.  If so, the discussion in
> point 2 should make this explicit.

Gary's observations are largely correct. The archive reports actually
contain several different types of information which we believe indicate
metadata quality. The star-based rating system is a very coarse metric
resulting from various documented combinations of other parameters, and is
simply reported in section 2.

In this particular context, the metric is the score-based distribution of
the entire set of records within an archive. Calculations are performed
against a hypothetical maxima of 10 points per record, and then simply
reduced to a score out of 5 for representational purposes. Rounding is
always upward,

> If I'm on the right track that the items that are averaged are just those
> in point 4, I'm still not completely clear on what constitutes an "item".
> ... Okay, as I look back and forth some more, I'm developing a new
> hypothesis, namely, that "item" refers to the record returned for one
> archive holding.  (Up to this point I was thinking it meant, "one of the
> things on our checklist of quality points to check for".)

The documentation is updated to reflect the situation more accurately. The
basic unit handled in these reports is the metadata record for a single
item.

> So, does that
> mean, each harvested record from the archive is given a quality score from
> 0 to 10, and the stars are based on the average for all records?  That is
> starting to sound more likely.

This is exactly the case.

> In that case, it still seems like the stars are based only on "4. Metadata
> quality".

Again, this is exactly the case. The star based ranking system is directly
related to Metadata Quality in #4.

> Now I think I'm understanding what the quality metric is, but I
> want to make sure.  The first part of it is:
>
> Code exists score =              ( Number of elements containing code
> attributes ) / ( Number of elements in the record of type with associated
> code )
>
> Does this mean:
>
> Code exists score =              ( Number of elements containing a code
> attribute ) / ( Number of elements in the record that could have a code
> attribute )
>
> If so, then we could explore some actual cases and ask if we are getting a
> reasonable answer.  For instance, if there were a record that contained
> only one element and that was a <subject> element with a code from
> OLAC-Language, would that mean a "code exists score" of 1.0?  It would be
> missing 4 out of 5 core elements for a deduction of 10 * (1/5)(.8) = 1.6,
> yielding a total score of 8.4.  If the archive contained thousands of
> records, all of which had only a coded subject element, then the average
> item score would be 8.4 for an overall rating of four stars.  Have I
> understood the formulas correctly?

Yes, your working here is correct.

> If so, then I think we will need to do
> more work on figuring out the best way to compute the overall score.  In
> this case a score that multiplies the percentage of core elements by the
> percentage of code exists would yield 2 out of 10 which sounds like a more
> appropriate score.

That's precisely the discussion I'd like to have in this forum. Consider
the archive quality metrics a first cut, they can surely be refined to be
better aligned with the values of the community as a whole.

> A fine point on "code exists" is what it does with non-OLAC refinements.
> For instance, if a record contained only two elements and they were <type>
> elements, one with the OLAC Linguistic Type and the other with the DCMI
> Type, would that score as 1.0 or 0.5 on the code exists scale?  I looks to
> me like it would be 0.5, which is half as good as the score of a record
> consisting of only one coded <type> element, but in fact, the record with
> two <type> elements is a better quality record.

Yes, in the case above there'd be a score of 0.5. Its these types of
weaknesses in the algorithms we'd like to address.

There's a more general issue here, and that is how we value DC as opposed
to OLAC metadata. The current situation is that there's actually little to
distinguish OLAC from a general DC community, since the specific
extensions promoted by OLAC are only lightly used.

> The metric for "3. Archive Diversity" needs more thought.  It is defined
> as:
>
> Diversity = (Distinct code values / Number instances of element) * 100
>
> The scores for diversity with respect to the Linguistic Type code are
> illustrate the problem well.  An archive containing only one record which
> is coded for one linguistic type would score 100%.  Whereas an archive
> containing 1,000 records, all of which have a type element for the same
> code would score 0.1%--but the one archive is not 1000 times are diverse as
> the other.

Absolutely, its a known weakness.

> I'm wondering if the formula shouldn't be:
>
> Diversity = (Distinct code values / Total codes in vocabulary) * 100
>
> Then every archive that has at least one instance of all three possible
> values of Linguistic Type (regardless of how many records it has) would be
> maximally diverse with respect to linguistic type.  I think that sounds
> more correct.
>
> Rather than Diversity, I wonder if the concepts of Breadth and Depth would
> serve better.  That is, the Breadth of an archive (with respect to a
> controlled vocabulary) would be the percentage of possible values that it
> has.  It's Depth (with respect to that vocabulary) would be the average
> number of records it has for each used code value.

I think that is an excellent proposition. Diversity was simply an attempt
to quantify this dimension, but perhaps more metrics are warranted [see
the following discussion about new metrics]

> On "7. Code usage", elements that may take a code are in focus.  I think it
> should be the code sets (i.e. the extensions) themselves.  We presently
> define five extensions, but some can occur with more than one element.  I
> think there are a total of 7 element-extension combinations.  I think it is
> those that should be analyzed here, not just the elements.  For instance,
> <subject> can occur with Language and Linguistic Field.  Those should be
> calculated as two separate entries in the chart.

That's an excellent suggestion. I'll add it to the feature list.

Recent discussions have also proposed a number of new metrics and
alternatives to existing ones, and I'll post them here shortly for
comment.

Regards

Baden