From badenh at CS.MU.OZ.AU Wed Aug 4 03:00:11 2004 From: badenh at CS.MU.OZ.AU (Baden Hughes) Date: Wed, 4 Aug 2004 13:00:11 +1000 Subject: updated OLAC Archive Report Cards In-Reply-To: Message-ID: First, apologies for the delay in replying to the list. Gary and I met up at a recent EMELD meeting, and discussed this posting and related issues at some length, and I'm extracting the relevant parts of our conversation here as a response to the posting itself. A number of Gary's points have been addressed through the posting of revised and consistent documentation, which is now available from each archive report. > I'm looking at the documentation page and find that the explanation in "2. > Star Rating" isn't enough information to make it clear. That says it is > based on the "average item score out of 10". I'm not sure what that means. > A natural conclusion would be that each of the remaining outline points in > the document is an item, and the overall rating is based on the average of > those. But I don't think that is what it means, since those don't describe > things scored on the basis of 0 to 10. > 10 point scoring seems to appear only in "4. Metadata Quality", and that > section does talk about "items", so is it the case that the star rating > deals only with point 4 on metadata quality. If so, the discussion in > point 2 should make this explicit. Gary's observations are largely correct. The archive reports actually contain several different types of information which we believe indicate metadata quality. The star-based rating system is a very coarse metric resulting from various documented combinations of other parameters, and is simply reported in section 2. In this particular context, the metric is the score-based distribution of the entire set of records within an archive. Calculations are performed against a hypothetical maxima of 10 points per record, and then simply reduced to a score out of 5 for representational purposes. Rounding is always upward, > If I'm on the right track that the items that are averaged are just those > in point 4, I'm still not completely clear on what constitutes an "item". > ... Okay, as I look back and forth some more, I'm developing a new > hypothesis, namely, that "item" refers to the record returned for one > archive holding. (Up to this point I was thinking it meant, "one of the > things on our checklist of quality points to check for".) The documentation is updated to reflect the situation more accurately. The basic unit handled in these reports is the metadata record for a single item. > So, does that > mean, each harvested record from the archive is given a quality score from > 0 to 10, and the stars are based on the average for all records? That is > starting to sound more likely. This is exactly the case. > In that case, it still seems like the stars are based only on "4. Metadata > quality". Again, this is exactly the case. The star based ranking system is directly related to Metadata Quality in #4. > Now I think I'm understanding what the quality metric is, but I > want to make sure. The first part of it is: > > Code exists score = ( Number of elements containing code > attributes ) / ( Number of elements in the record of type with associated > code ) > > Does this mean: > > Code exists score = ( Number of elements containing a code > attribute ) / ( Number of elements in the record that could have a code > attribute ) > > If so, then we could explore some actual cases and ask if we are getting a > reasonable answer. For instance, if there were a record that contained > only one element and that was a element with a code from > OLAC-Language, would that mean a "code exists score" of 1.0? It would be > missing 4 out of 5 core elements for a deduction of 10 * (1/5)(.8) = 1.6, > yielding a total score of 8.4. If the archive contained thousands of > records, all of which had only a coded subject element, then the average > item score would be 8.4 for an overall rating of four stars. Have I > understood the formulas correctly? Yes, your working here is correct. > If so, then I think we will need to do > more work on figuring out the best way to compute the overall score. In > this case a score that multiplies the percentage of core elements by the > percentage of code exists would yield 2 out of 10 which sounds like a more > appropriate score. That's precisely the discussion I'd like to have in this forum. Consider the archive quality metrics a first cut, they can surely be refined to be better aligned with the values of the community as a whole. > A fine point on "code exists" is what it does with non-OLAC refinements. > For instance, if a record contained only two elements and they were > elements, one with the OLAC Linguistic Type and the other with the DCMI > Type, would that score as 1.0 or 0.5 on the code exists scale? I looks to > me like it would be 0.5, which is half as good as the score of a record > consisting of only one coded element, but in fact, the record with > two elements is a better quality record. Yes, in the case above there'd be a score of 0.5. Its these types of weaknesses in the algorithms we'd like to address. There's a more general issue here, and that is how we value DC as opposed to OLAC metadata. The current situation is that there's actually little to distinguish OLAC from a general DC community, since the specific extensions promoted by OLAC are only lightly used. > The metric for "3. Archive Diversity" needs more thought. It is defined > as: > > Diversity = (Distinct code values / Number instances of element) * 100 > > The scores for diversity with respect to the Linguistic Type code are > illustrate the problem well. An archive containing only one record which > is coded for one linguistic type would score 100%. Whereas an archive > containing 1,000 records, all of which have a type element for the same > code would score 0.1%--but the one archive is not 1000 times are diverse as > the other. Absolutely, its a known weakness. > I'm wondering if the formula shouldn't be: > > Diversity = (Distinct code values / Total codes in vocabulary) * 100 > > Then every archive that has at least one instance of all three possible > values of Linguistic Type (regardless of how many records it has) would be > maximally diverse with respect to linguistic type. I think that sounds > more correct. > > Rather than Diversity, I wonder if the concepts of Breadth and Depth would > serve better. That is, the Breadth of an archive (with respect to a > controlled vocabulary) would be the percentage of possible values that it > has. It's Depth (with respect to that vocabulary) would be the average > number of records it has for each used code value. I think that is an excellent proposition. Diversity was simply an attempt to quantify this dimension, but perhaps more metrics are warranted [see the following discussion about new metrics] > On "7. Code usage", elements that may take a code are in focus. I think it > should be the code sets (i.e. the extensions) themselves. We presently > define five extensions, but some can occur with more than one element. I > think there are a total of 7 element-extension combinations. I think it is > those that should be analyzed here, not just the elements. For instance, > can occur with Language and Linguistic Field. Those should be > calculated as two separate entries in the chart. That's an excellent suggestion. I'll add it to the feature list. Recent discussions have also proposed a number of new metrics and alternatives to existing ones, and I'll post them here shortly for comment. Regards Baden From badenh at CS.MU.OZ.AU Wed Aug 4 03:19:42 2004 From: badenh at CS.MU.OZ.AU (Baden Hughes) Date: Wed, 4 Aug 2004 13:19:42 +1000 Subject: possible new metrics for metadata quality In-Reply-To: Message-ID: Gary Simons and I recently discussed a number of new and revised metrics for metadata quality in response to the deployment of the archive report cards for OLAC. Here I attempt to describe these, and seek comment on them from the wider OLAC implementers community before adding them to the feature list for a subsequent version of the archive report cards. We started the discussion from the perspective that there are certain features that OLAC values inherently, and then attempted to derive metrics which measure how well an archive meets those values. *Data Available Online OLAC values the availability of language data in an online accessible format. For the most part, we can compute a metric for this based on the presence or absence of a URI for location. Thus any resource which is available online should be scored higher than resources which are only available offline. * Use of OLAC Vocabularies The controlled vocabularies distinguish OLAC from other Dublin Core communities, and provide the value added component of this metadata from a linguistic perspective. Thus the efficient use of OLAC vocabularies in all possible circumstances should score highly on a quality metric. * Scope of the Vocabulary Controlled vocabularies which make fine grained, linguistically intuitive distinctions are more useful for information retrieval tasks that those which provide broader but generic information. Thus the use of controlled vocabularies which are strongly linguistically grounded should be scored more highly than those which are general. * Quality of an Archive's Metadata It is possible to assess metadata on a per record basis and in aggregate against a baseline optimimum. Consistent use of metadata elements should be scored more highly than randomised use. The overall distribution of metadata can be scored based on the proportion of metadata records which individually score highly. * Quality of the Archive's Collection In addition to simple aggregation of the quality of metadata for all records within an archive, there are other metrics which also assist with the assessment of the quality of an archive's collection. Thus the Breadth of Archive Coverage would be equal to the number of languages represented within an archive divided by the number of languages in the Ethnologue. Following, the Depth of Archive Coverage would be equal to the number of records in the archive divided by the number of languages represented within that archive. In addition to these values, simple surveying techniques such as counting the useage of various elements are of value to archive maintainers who seek to improve the quality of their metadata, and in turn, to the wider community who uses the collection of archives as a starting point for searching for language data. Additional metrics which have occurred recently and could also be considered include the following: - use of best practice or enduring data formats (eg XML vs DOC) - collection of related materials and types (eg dictionaries, grammars, recordings for the same language) The comments of the list membership are, as always, most welcome. Regards Baden From badenh at CS.MU.OZ.AU Wed Aug 4 03:00:11 2004 From: badenh at CS.MU.OZ.AU (Baden Hughes) Date: Wed, 4 Aug 2004 13:00:11 +1000 Subject: updated OLAC Archive Report Cards In-Reply-To: Message-ID: First, apologies for the delay in replying to the list. Gary and I met up at a recent EMELD meeting, and discussed this posting and related issues at some length, and I'm extracting the relevant parts of our conversation here as a response to the posting itself. A number of Gary's points have been addressed through the posting of revised and consistent documentation, which is now available from each archive report. > I'm looking at the documentation page and find that the explanation in "2. > Star Rating" isn't enough information to make it clear. That says it is > based on the "average item score out of 10". I'm not sure what that means. > A natural conclusion would be that each of the remaining outline points in > the document is an item, and the overall rating is based on the average of > those. But I don't think that is what it means, since those don't describe > things scored on the basis of 0 to 10. > 10 point scoring seems to appear only in "4. Metadata Quality", and that > section does talk about "items", so is it the case that the star rating > deals only with point 4 on metadata quality. If so, the discussion in > point 2 should make this explicit. Gary's observations are largely correct. The archive reports actually contain several different types of information which we believe indicate metadata quality. The star-based rating system is a very coarse metric resulting from various documented combinations of other parameters, and is simply reported in section 2. In this particular context, the metric is the score-based distribution of the entire set of records within an archive. Calculations are performed against a hypothetical maxima of 10 points per record, and then simply reduced to a score out of 5 for representational purposes. Rounding is always upward, > If I'm on the right track that the items that are averaged are just those > in point 4, I'm still not completely clear on what constitutes an "item". > ... Okay, as I look back and forth some more, I'm developing a new > hypothesis, namely, that "item" refers to the record returned for one > archive holding. (Up to this point I was thinking it meant, "one of the > things on our checklist of quality points to check for".) The documentation is updated to reflect the situation more accurately. The basic unit handled in these reports is the metadata record for a single item. > So, does that > mean, each harvested record from the archive is given a quality score from > 0 to 10, and the stars are based on the average for all records? That is > starting to sound more likely. This is exactly the case. > In that case, it still seems like the stars are based only on "4. Metadata > quality". Again, this is exactly the case. The star based ranking system is directly related to Metadata Quality in #4. > Now I think I'm understanding what the quality metric is, but I > want to make sure. The first part of it is: > > Code exists score = ? ? ? ? ? ? ( Number of elements containing code > attributes ) / ( Number of elements in the record of type with associated > code ) > > Does this mean: > > Code exists score = ? ? ? ? ? ? ( Number of elements containing a code > attribute ) / ( Number of elements in the record that could have a code > attribute ) > > If so, then we could explore some actual cases and ask if we are getting a > reasonable answer. For instance, if there were a record that contained > only one element and that was a element with a code from > OLAC-Language, would that mean a "code exists score" of 1.0? It would be > missing 4 out of 5 core elements for a deduction of 10 * (1/5)(.8) = 1.6, > yielding a total score of 8.4. If the archive contained thousands of > records, all of which had only a coded subject element, then the average > item score would be 8.4 for an overall rating of four stars. Have I > understood the formulas correctly? Yes, your working here is correct. > If so, then I think we will need to do > more work on figuring out the best way to compute the overall score. In > this case a score that multiplies the percentage of core elements by the > percentage of code exists would yield 2 out of 10 which sounds like a more > appropriate score. That's precisely the discussion I'd like to have in this forum. Consider the archive quality metrics a first cut, they can surely be refined to be better aligned with the values of the community as a whole. > A fine point on "code exists" is what it does with non-OLAC refinements. > For instance, if a record contained only two elements and they were > elements, one with the OLAC Linguistic Type and the other with the DCMI > Type, would that score as 1.0 or 0.5 on the code exists scale? I looks to > me like it would be 0.5, which is half as good as the score of a record > consisting of only one coded element, but in fact, the record with > two elements is a better quality record. Yes, in the case above there'd be a score of 0.5. Its these types of weaknesses in the algorithms we'd like to address. There's a more general issue here, and that is how we value DC as opposed to OLAC metadata. The current situation is that there's actually little to distinguish OLAC from a general DC community, since the specific extensions promoted by OLAC are only lightly used. > The metric for "3. Archive Diversity" needs more thought. It is defined > as: > > Diversity = (Distinct code values / Number instances of element) * 100 > > The scores for diversity with respect to the Linguistic Type code are > illustrate the problem well. An archive containing only one record which > is coded for one linguistic type would score 100%. Whereas an archive > containing 1,000 records, all of which have a type element for the same > code would score 0.1%--but the one archive is not 1000 times are diverse as > the other. Absolutely, its a known weakness. > I'm wondering if the formula shouldn't be: > > Diversity = (Distinct code values / Total codes in vocabulary) * 100 > > Then every archive that has at least one instance of all three possible > values of Linguistic Type (regardless of how many records it has) would be > maximally diverse with respect to linguistic type. I think that sounds > more correct. > > Rather than Diversity, I wonder if the concepts of Breadth and Depth would > serve better. That is, the Breadth of an archive (with respect to a > controlled vocabulary) would be the percentage of possible values that it > has. It's Depth (with respect to that vocabulary) would be the average > number of records it has for each used code value. I think that is an excellent proposition. Diversity was simply an attempt to quantify this dimension, but perhaps more metrics are warranted [see the following discussion about new metrics] > On "7. Code usage", elements that may take a code are in focus. I think it > should be the code sets (i.e. the extensions) themselves. We presently > define five extensions, but some can occur with more than one element. I > think there are a total of 7 element-extension combinations. I think it is > those that should be analyzed here, not just the elements. For instance, > can occur with Language and Linguistic Field. Those should be > calculated as two separate entries in the chart. That's an excellent suggestion. I'll add it to the feature list. Recent discussions have also proposed a number of new metrics and alternatives to existing ones, and I'll post them here shortly for comment. Regards Baden From badenh at CS.MU.OZ.AU Wed Aug 4 03:19:42 2004 From: badenh at CS.MU.OZ.AU (Baden Hughes) Date: Wed, 4 Aug 2004 13:19:42 +1000 Subject: possible new metrics for metadata quality In-Reply-To: Message-ID: Gary Simons and I recently discussed a number of new and revised metrics for metadata quality in response to the deployment of the archive report cards for OLAC. Here I attempt to describe these, and seek comment on them from the wider OLAC implementers community before adding them to the feature list for a subsequent version of the archive report cards. We started the discussion from the perspective that there are certain features that OLAC values inherently, and then attempted to derive metrics which measure how well an archive meets those values. *Data Available Online OLAC values the availability of language data in an online accessible format. For the most part, we can compute a metric for this based on the presence or absence of a URI for location. Thus any resource which is available online should be scored higher than resources which are only available offline. * Use of OLAC Vocabularies The controlled vocabularies distinguish OLAC from other Dublin Core communities, and provide the value added component of this metadata from a linguistic perspective. Thus the efficient use of OLAC vocabularies in all possible circumstances should score highly on a quality metric. * Scope of the Vocabulary Controlled vocabularies which make fine grained, linguistically intuitive distinctions are more useful for information retrieval tasks that those which provide broader but generic information. Thus the use of controlled vocabularies which are strongly linguistically grounded should be scored more highly than those which are general. * Quality of an Archive's Metadata It is possible to assess metadata on a per record basis and in aggregate against a baseline optimimum. Consistent use of metadata elements should be scored more highly than randomised use. The overall distribution of metadata can be scored based on the proportion of metadata records which individually score highly. * Quality of the Archive's Collection In addition to simple aggregation of the quality of metadata for all records within an archive, there are other metrics which also assist with the assessment of the quality of an archive's collection. Thus the Breadth of Archive Coverage would be equal to the number of languages represented within an archive divided by the number of languages in the Ethnologue. Following, the Depth of Archive Coverage would be equal to the number of records in the archive divided by the number of languages represented within that archive. In addition to these values, simple surveying techniques such as counting the useage of various elements are of value to archive maintainers who seek to improve the quality of their metadata, and in turn, to the wider community who uses the collection of archives as a starting point for searching for language data. Additional metrics which have occurred recently and could also be considered include the following: - use of best practice or enduring data formats (eg XML vs DOC) - collection of related materials and types (eg dictionaries, grammars, recordings for the same language) The comments of the list membership are, as always, most welcome. Regards Baden