From thien at UNIMELB.EDU.AU Mon Jun 21 01:30:17 2004 From: thien at UNIMELB.EDU.AU (Nicholas Thieberger) Date: Mon, 21 Jun 2004 11:30:17 +1000 Subject: OLAC Metadata Message-ID: Query from Linda Barwick following a discussion on the 'report card' for our metadata set. >1. creator - why do they score on this under 'Code Usage' when DC >recommendation is to use contributor? >"Dublin Core now discourages the use of the Creator element, >recommending that all Role information be associated with >Contributor elements." http://www.language-archives.org/REC/role.html From badenh at CS.MU.OZ.AU Mon Jun 21 02:51:23 2004 From: badenh at CS.MU.OZ.AU (Baden Hughes) Date: Mon, 21 Jun 2004 12:51:23 +1000 Subject: OLAC Metadata In-Reply-To: Message-ID: It's nice to hear that someone's looking at the reports :-) The Dublin Core recommendation is now that Role attributes be associated with Contributor, however, the development of the OLAC controlled vocabulary pre-dated the DC recommendation. In fact, the OLAC Role Vocabulary describes attributes which may be referential to both Creator and Contributor, and these elements exist independently within OLAC records. For the reference of others on this list, I'm referring here to: http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=79 As to why the creator is scored under 'Code Usage': in PARADISEC's case, where Contributor is only present, then the Creator element is in fact redundant (hence the 0% score). In other words, in the context of this particular archive, the OLAC Role vocabulary is only applicable to Contributor. However, in other archives where both Creator and Contributor are used, both with and without the OLAC Role Vocabulary. While it could be argued that these latter archives are not following "best practice" as recommended by Dublin Core, its interesting to note that overall: http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=all Creator is used more times than Contributor, but the OLAC Role vocabulary is much more widely used for Contributor instances. This type of observation is precisely why the Archive Report card was developed, to allow archive maintainers to see where best to expend effort in metadata quality improvment. Regards Baden On Mon, 21 Jun 2004, Nicholas Thieberger wrote: > Query from Linda Barwick following a discussion on the 'report card' > for our metadata set. > > >1. creator - why do they score on this under 'Code Usage' when DC > >recommendation is to use contributor? > >"Dublin Core now discourages the use of the Creator element, > >recommending that all Role information be associated with > >Contributor elements." http://www.language-archives.org/REC/role.html > From badenh at CS.MU.OZ.AU Mon Jun 28 00:14:15 2004 From: badenh at CS.MU.OZ.AU (Baden Hughes) Date: Mon, 28 Jun 2004 10:14:15 +1000 Subject: updated OLAC Archive Report Cards Message-ID: Dear OLAC Implementers, You may recall that in March, we announced a a new service which had recently been added to the OLAC site, namely archive report cards. These give summary statistics for each repository and an assessment of the quality of the repository's metadata against both external best practice recommendations and the relative use practices within the OLAC context. An updated version of the archive report cards are now available - changes include: - updating evaluation algorithm to account for changes in DC recommendations (eg use of contributor rather than creator) - updated labelling of graphs to be more consistent with OLAC terminology The report cards can be accessed by clicking the "REPORT CARD" links on the OLAC Archives page [1]. The report is also available for the full set of repositories [2]. Information about how these reports are generated is also available [3]. Reports are updated after every harvest, every 12 hours at the current point in time. The evaluation metric rewards the use of OLAC extensions (controlled vocabularies), and what we consider to be core DC elements: title, date, subject, description, and identifier. The service was developed by Amol Kamat, Baden Hughes, and Steven Bird at the University of Melbourne, with sponsorship from the Department of Computer Science and Software Engineering. We welcome your feedback. Regards Baden Hughes [1] http://www.language-archives.org/archives.php4 [2] http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=all [3] http://www.language-archives.org/tools/reports/ExplainReport.html From badenh at CS.MU.OZ.AU Mon Jun 28 00:27:28 2004 From: badenh at CS.MU.OZ.AU (Baden Hughes) Date: Mon, 28 Jun 2004 10:27:28 +1000 Subject: new OLAC-oriented search engine Message-ID: Dear OLAC Implementers, A new service has recently been added to the LDC site, namely an OLAC Search Engine [1]. This search engine complements other OLAC search engines including those already deployed at LinguistList [2]. This instantiation takes an entirely different approach, and aims to be very similar to key-word-based web search engines. Other features of the search engine include: - search by alternate names using the Ethnologue - search by language code - key word in context highlighting in search results - search for similar spellings - exact/approximate/partial string matching - search operators (AND, OR, NOT, +, -) - support for inline syntax (eg creator:hale) - search for related items on Google. The search engine results are displayed according to the quality of the metadata on a per record and per archive basis. These rankings are derived from the same underlying algorithm used in the OLAC Archive Reports [3]. The service was developed by Amol Kamat, Baden Hughes, and Steven Bird at the University of Melbourne, with sponsorship from the Department of Computer Science and Software Engineering. We welcome your feedback. Regards Baden Hughes [1] http://wave.ldc.upenn.edu/olac/search.php [2] http://cf.linguistlist.org/cfdocs/new-website/LL-WorkingDirs/olac/olac-search-advanced.cfm [3] http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=all From hdry at LINGUISTLIST.ORG Mon Jun 28 13:08:49 2004 From: hdry at LINGUISTLIST.ORG (Helen Aristar-Dry) Date: Mon, 28 Jun 2004 09:08:49 -0400 Subject: new OLAC-oriented search engine In-Reply-To: Message-ID: Hi, Baden, The ldc link doesn't seem to be working. -Helen Quoting Baden Hughes : > Dear OLAC Implementers, > > A new service has recently been added to the LDC site, namely an OLAC > Search Engine [1]. > > This search engine complements other OLAC search engines including those > already deployed at LinguistList [2]. This instantiation takes an entirely > different > approach, and aims to be very similar to key-word-based web search engines. > > Other features of the search engine include: > > - search by alternate names using the Ethnologue > - search by language code > - key word in context highlighting in search results > - search for similar spellings > - exact/approximate/partial string matching > - search operators (AND, OR, NOT, +, -) > - support for inline syntax (eg creator:hale) > - search for related items on Google. > > The search engine results are displayed according to the quality of the > metadata on a per record and per archive basis. These rankings are derived > from the same underlying > algorithm used in the OLAC Archive Reports [3]. > > The service was developed by Amol Kamat, Baden Hughes, and Steven Bird > at the University of Melbourne, with sponsorship from the Department of > Computer Science and Software Engineering. We welcome your feedback. > > Regards > > Baden Hughes > > [1] http://wave.ldc.upenn.edu/olac/search.php > [2] > http://cf.linguistlist.org/cfdocs/new-website/LL-WorkingDirs/olac/olac-search-advanced.cfm > [3] > http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=all > -Helen Helen Aristar-Dry Prof. of Linguistics Eastern Michigan University Ypsilanti, MI 48103 hdry at linguistlist.org 734.487.0144 (office) 734.741.1567 (home) ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From haejoong at LDC.UPENN.EDU Mon Jun 28 14:26:55 2004 From: haejoong at LDC.UPENN.EDU (Haejoong Lee) Date: Mon, 28 Jun 2004 10:26:55 -0400 Subject: new OLAC-oriented search engine In-Reply-To: <1088428129.40e018614fd68@webmail.linguistlist.org> Message-ID: On Mon, Jun 28, 2004 at 09:08:49AM -0400, Helen Aristar-Dry wrote: > Hi, Baden, > The ldc link doesn't seem to be working. Hi Helen, LDC is experiencing intermittent network connectivity with locations outside of the Penn Network. I hope this can be resolved soon. -Haejoong From badenh at CS.MU.OZ.AU Tue Jun 29 11:47:46 2004 From: badenh at CS.MU.OZ.AU (Baden Hughes) Date: Tue, 29 Jun 2004 21:47:46 +1000 Subject: correction: new OLAC-oriented search engine URL Message-ID: The URL for the OLAC-oriented search engine was incorrectly specified earlier. The correct URL is: http://www.ldc.upenn.edu/olac/search.php Apologies for any inconvenience. Regards Baden From Gary_Simons at SIL.ORG Wed Jun 30 02:25:08 2004 From: Gary_Simons at SIL.ORG (Gary Simons) Date: Tue, 29 Jun 2004 21:25:08 -0500 Subject: updated OLAC Archive Report Cards In-Reply-To: Message-ID: Baden, Thanks for all your work on this report card system. I think it is a great idea and trust that it will serve to help all of us improve the quality of the metadata we are publishing. I'm the implementer for two archives, one of which ends up with a five-star rating, while the other ends up with just two. My intuition is that they are not that different in quality, so I'm trying to understand the scoring system to see what accounts for the huge difference. I've also reviewed our archive report card with Joan Spanne, our archivist, to see what feedback she might have. She is actually responsible for many of the observations in this note. I'm looking at the documentation page and find that the explanation in "2. Star Rating" isn't enough information to make it clear. That says it is based on the "average item score out of 10". I'm not sure what that means. A natural conclusion would be that each of the remaining outline points in the document is an item, and the overall rating is based on the average of those. But I don't think that is what it means, since those don't describe things scored on the basis of 0 to 10. 10 point scoring seems to appear only in "4. Metadata Quality", and that section does talk about "items", so is it the case that the star rating deals only with point 4 on metadata quality. If so, the discussion in point 2 should make this explicit. If I'm on the right track that the items that are averaged are just those in point 4, I'm still not completely clear on what constitutes an "item". ... Okay, as I look back and forth some more, I'm developing a new hypothesis, namely, that "item" refers to the record returned for one archive holding. (Up to this point I was thinking it meant, "one of the things on our checklist of quality points to check for".) So, does that mean, each harvested record from the archive is given a quality score from 0 to 10, and the stars are based on the average for all records? That is starting to sound more likely. In that case, it still seems like the stars are based only on "4. Metadata quality". Now I think I'm understanding what the quality metric is, but I want to make sure. The first part of it is: Code exists score =             ( Number of elements containing code attributes ) / ( Number of elements in the record of type with associated code ) Does this mean: Code exists score =             ( Number of elements containing a code attribute ) / ( Number of elements in the record that could have a code attribute ) If so, then we could explore some actual cases and ask if we are getting a reasonable answer. For instance, if there were a record that contained only one element and that was a element with a code from OLAC-Language, would that mean a "code exists score" of 1.0? It would be missing 4 out of 5 core elements for a deduction of 10 * (1/5)(.8) = 1.6, yielding a total score of 8.4. If the archive contained thousands of records, all of which had only a coded subject element, then the average item score would be 8.4 for an overall rating of four stars. Have I understood the formulas correctly? If so, then I think we will need to do more work on figuring out the best way to compute the overall score. In this case a score that multiplies the percentage of core elements by the percentage of code exists would yield 2 out of 10 which sounds like a more appropriate score. A fine point on "code exists" is what it does with non-OLAC refinements. For instance, if a record contained only two elements and they were elements, one with the OLAC Linguistic Type and the other with the DCMI Type, would that score as 1.0 or 0.5 on the code exists scale? I looks to me like it would be 0.5, which is half as good as the score of a record consisting of only one coded element, but in fact, the record with two elements is a better quality record. The metric for "3. Archive Diversity" needs more thought. It is defined as: Diversity = (Distinct code values / Number instances of element) * 100 The scores for diversity with respect to the Linguistic Type code are illustrate the problem well. An archive containing only one record which is coded for one linguistic type would score 100%. Whereas an archive containing 1,000 records, all of which have a type element for the same code would score 0.1%--but the one archive is not 1000 times are diverse as the other. I'm wondering if the formula shouldn't be: Diversity = (Distinct code values / Total codes in vocabulary) * 100 Then every archive that has at least one instance of all three possible values of Linguistic Type (regardless of how many records it has) would be maximally diverse with respect to linguistic type. I think that sounds more correct. Rather than Diversity, I wonder if the concepts of Breadth and Depth would serve better. That is, the Breadth of an archive (with respect to a controlled vocabulary) would be the percentage of possible values that it has. It's Depth (with respect to that vocabulary) would be the average number of records it has for each used code value. On "7. Code usage", elements that may take a code are in focus. I think it should be the code sets (i.e. the extensions) themselves. We presently define five extensions, but some can occur with more than one element. I think there are a total of 7 element-extension combinations. I think it is those that should be analyzed here, not just the elements. For instance, can occur with Language and Linguistic Field. Those should be calculated as two separate entries in the chart. That's all I have for now, but that is plenty to get the discussion rolling. Are you going to be at the EMELD conference? If so, that might be a great opportunity for some of us to gather at a whiteboard and thrash out possible metrics. Hope to see you there, -Gary Baden Hughes Sent by: OLAC Implementers List To: cc: Subject: updated OLAC Archive Report Cards 06/27/2004 07:14 PM Please respond to Open Language Archives Community Implementers List Dear OLAC Implementers, You may recall that in March, we announced a a new service which had recently been added to the OLAC site, namely archive report cards. These give summary statistics for each repository and an assessment of the quality of the repository's metadata against both external best practice recommendations and the relative use practices within the OLAC context. An updated version of the archive report cards are now available - changes include: - updating evaluation algorithm to account for changes in DC recommendations (eg use of contributor rather than creator) - updated labelling of graphs to be more consistent with OLAC terminology The report cards can be accessed by clicking the "REPORT CARD" links on the OLAC Archives page [1]. The report is also available for the full set of repositories [2]. Information about how these reports are generated is also available [3]. Reports are updated after every harvest, every 12 hours at the current point in time. The evaluation metric rewards the use of OLAC extensions (controlled vocabularies), and what we consider to be core DC elements: title, date, subject, description, and identifier. The service was developed by Amol Kamat, Baden Hughes, and Steven Bird at the University of Melbourne, with sponsorship from the Department of Computer Science and Software Engineering. We welcome your feedback. Regards Baden Hughes [1] http://www.language-archives.org/archives.php4 [2] http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=all [3] http://www.language-archives.org/tools/reports/ExplainReport.html From thien at UNIMELB.EDU.AU Mon Jun 21 01:30:17 2004 From: thien at UNIMELB.EDU.AU (Nicholas Thieberger) Date: Mon, 21 Jun 2004 11:30:17 +1000 Subject: OLAC Metadata Message-ID: Query from Linda Barwick following a discussion on the 'report card' for our metadata set. >1. creator - why do they score on this under 'Code Usage' when DC >recommendation is to use contributor? >"Dublin Core now discourages the use of the Creator element, >recommending that all Role information be associated with >Contributor elements." http://www.language-archives.org/REC/role.html From badenh at CS.MU.OZ.AU Mon Jun 21 02:51:23 2004 From: badenh at CS.MU.OZ.AU (Baden Hughes) Date: Mon, 21 Jun 2004 12:51:23 +1000 Subject: OLAC Metadata In-Reply-To: Message-ID: It's nice to hear that someone's looking at the reports :-) The Dublin Core recommendation is now that Role attributes be associated with Contributor, however, the development of the OLAC controlled vocabulary pre-dated the DC recommendation. In fact, the OLAC Role Vocabulary describes attributes which may be referential to both Creator and Contributor, and these elements exist independently within OLAC records. For the reference of others on this list, I'm referring here to: http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=79 As to why the creator is scored under 'Code Usage': in PARADISEC's case, where Contributor is only present, then the Creator element is in fact redundant (hence the 0% score). In other words, in the context of this particular archive, the OLAC Role vocabulary is only applicable to Contributor. However, in other archives where both Creator and Contributor are used, both with and without the OLAC Role Vocabulary. While it could be argued that these latter archives are not following "best practice" as recommended by Dublin Core, its interesting to note that overall: http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=all Creator is used more times than Contributor, but the OLAC Role vocabulary is much more widely used for Contributor instances. This type of observation is precisely why the Archive Report card was developed, to allow archive maintainers to see where best to expend effort in metadata quality improvment. Regards Baden On Mon, 21 Jun 2004, Nicholas Thieberger wrote: > Query from Linda Barwick following a discussion on the 'report card' > for our metadata set. > > >1. creator - why do they score on this under 'Code Usage' when DC > >recommendation is to use contributor? > >"Dublin Core now discourages the use of the Creator element, > >recommending that all Role information be associated with > >Contributor elements." http://www.language-archives.org/REC/role.html > From badenh at CS.MU.OZ.AU Mon Jun 28 00:14:15 2004 From: badenh at CS.MU.OZ.AU (Baden Hughes) Date: Mon, 28 Jun 2004 10:14:15 +1000 Subject: updated OLAC Archive Report Cards Message-ID: Dear OLAC Implementers, You may recall that in March, we announced a a new service which had recently been added to the OLAC site, namely archive report cards. These give summary statistics for each repository and an assessment of the quality of the repository's metadata against both external best practice recommendations and the relative use practices within the OLAC context. An updated version of the archive report cards are now available - changes include: - updating evaluation algorithm to account for changes in DC recommendations (eg use of contributor rather than creator) - updated labelling of graphs to be more consistent with OLAC terminology The report cards can be accessed by clicking the "REPORT CARD" links on the OLAC Archives page [1]. The report is also available for the full set of repositories [2]. Information about how these reports are generated is also available [3]. Reports are updated after every harvest, every 12 hours at the current point in time. The evaluation metric rewards the use of OLAC extensions (controlled vocabularies), and what we consider to be core DC elements: title, date, subject, description, and identifier. The service was developed by Amol Kamat, Baden Hughes, and Steven Bird at the University of Melbourne, with sponsorship from the Department of Computer Science and Software Engineering. We welcome your feedback. Regards Baden Hughes [1] http://www.language-archives.org/archives.php4 [2] http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=all [3] http://www.language-archives.org/tools/reports/ExplainReport.html From badenh at CS.MU.OZ.AU Mon Jun 28 00:27:28 2004 From: badenh at CS.MU.OZ.AU (Baden Hughes) Date: Mon, 28 Jun 2004 10:27:28 +1000 Subject: new OLAC-oriented search engine Message-ID: Dear OLAC Implementers, A new service has recently been added to the LDC site, namely an OLAC Search Engine [1]. This search engine complements other OLAC search engines including those already deployed at LinguistList [2]. This instantiation takes an entirely different approach, and aims to be very similar to key-word-based web search engines. Other features of the search engine include: - search by alternate names using the Ethnologue - search by language code - key word in context highlighting in search results - search for similar spellings - exact/approximate/partial string matching - search operators (AND, OR, NOT, +, -) - support for inline syntax (eg creator:hale) - search for related items on Google. The search engine results are displayed according to the quality of the metadata on a per record and per archive basis. These rankings are derived from the same underlying algorithm used in the OLAC Archive Reports [3]. The service was developed by Amol Kamat, Baden Hughes, and Steven Bird at the University of Melbourne, with sponsorship from the Department of Computer Science and Software Engineering. We welcome your feedback. Regards Baden Hughes [1] http://wave.ldc.upenn.edu/olac/search.php [2] http://cf.linguistlist.org/cfdocs/new-website/LL-WorkingDirs/olac/olac-search-advanced.cfm [3] http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=all From hdry at LINGUISTLIST.ORG Mon Jun 28 13:08:49 2004 From: hdry at LINGUISTLIST.ORG (Helen Aristar-Dry) Date: Mon, 28 Jun 2004 09:08:49 -0400 Subject: new OLAC-oriented search engine In-Reply-To: Message-ID: Hi, Baden, The ldc link doesn't seem to be working. -Helen Quoting Baden Hughes : > Dear OLAC Implementers, > > A new service has recently been added to the LDC site, namely an OLAC > Search Engine [1]. > > This search engine complements other OLAC search engines including those > already deployed at LinguistList [2]. This instantiation takes an entirely > different > approach, and aims to be very similar to key-word-based web search engines. > > Other features of the search engine include: > > - search by alternate names using the Ethnologue > - search by language code > - key word in context highlighting in search results > - search for similar spellings > - exact/approximate/partial string matching > - search operators (AND, OR, NOT, +, -) > - support for inline syntax (eg creator:hale) > - search for related items on Google. > > The search engine results are displayed according to the quality of the > metadata on a per record and per archive basis. These rankings are derived > from the same underlying > algorithm used in the OLAC Archive Reports [3]. > > The service was developed by Amol Kamat, Baden Hughes, and Steven Bird > at the University of Melbourne, with sponsorship from the Department of > Computer Science and Software Engineering. We welcome your feedback. > > Regards > > Baden Hughes > > [1] http://wave.ldc.upenn.edu/olac/search.php > [2] > http://cf.linguistlist.org/cfdocs/new-website/LL-WorkingDirs/olac/olac-search-advanced.cfm > [3] > http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=all > -Helen Helen Aristar-Dry Prof. of Linguistics Eastern Michigan University Ypsilanti, MI 48103 hdry at linguistlist.org 734.487.0144 (office) 734.741.1567 (home) ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From haejoong at LDC.UPENN.EDU Mon Jun 28 14:26:55 2004 From: haejoong at LDC.UPENN.EDU (Haejoong Lee) Date: Mon, 28 Jun 2004 10:26:55 -0400 Subject: new OLAC-oriented search engine In-Reply-To: <1088428129.40e018614fd68@webmail.linguistlist.org> Message-ID: On Mon, Jun 28, 2004 at 09:08:49AM -0400, Helen Aristar-Dry wrote: > Hi, Baden, > The ldc link doesn't seem to be working. Hi Helen, LDC is experiencing intermittent network connectivity with locations outside of the Penn Network. I hope this can be resolved soon. -Haejoong From badenh at CS.MU.OZ.AU Tue Jun 29 11:47:46 2004 From: badenh at CS.MU.OZ.AU (Baden Hughes) Date: Tue, 29 Jun 2004 21:47:46 +1000 Subject: correction: new OLAC-oriented search engine URL Message-ID: The URL for the OLAC-oriented search engine was incorrectly specified earlier. The correct URL is: http://www.ldc.upenn.edu/olac/search.php Apologies for any inconvenience. Regards Baden From Gary_Simons at SIL.ORG Wed Jun 30 02:25:08 2004 From: Gary_Simons at SIL.ORG (Gary Simons) Date: Tue, 29 Jun 2004 21:25:08 -0500 Subject: updated OLAC Archive Report Cards In-Reply-To: Message-ID: Baden, Thanks for all your work on this report card system. I think it is a great idea and trust that it will serve to help all of us improve the quality of the metadata we are publishing. I'm the implementer for two archives, one of which ends up with a five-star rating, while the other ends up with just two. My intuition is that they are not that different in quality, so I'm trying to understand the scoring system to see what accounts for the huge difference. I've also reviewed our archive report card with Joan Spanne, our archivist, to see what feedback she might have. She is actually responsible for many of the observations in this note. I'm looking at the documentation page and find that the explanation in "2. Star Rating" isn't enough information to make it clear. That says it is based on the "average item score out of 10". I'm not sure what that means. A natural conclusion would be that each of the remaining outline points in the document is an item, and the overall rating is based on the average of those. But I don't think that is what it means, since those don't describe things scored on the basis of 0 to 10. 10 point scoring seems to appear only in "4. Metadata Quality", and that section does talk about "items", so is it the case that the star rating deals only with point 4 on metadata quality. If so, the discussion in point 2 should make this explicit. If I'm on the right track that the items that are averaged are just those in point 4, I'm still not completely clear on what constitutes an "item". ... Okay, as I look back and forth some more, I'm developing a new hypothesis, namely, that "item" refers to the record returned for one archive holding. (Up to this point I was thinking it meant, "one of the things on our checklist of quality points to check for".) So, does that mean, each harvested record from the archive is given a quality score from 0 to 10, and the stars are based on the average for all records? That is starting to sound more likely. In that case, it still seems like the stars are based only on "4. Metadata quality". Now I think I'm understanding what the quality metric is, but I want to make sure. The first part of it is: Code exists score = ? ? ? ? ? ? ( Number of elements containing code attributes ) / ( Number of elements in the record of type with associated code ) Does this mean: Code exists score = ? ? ? ? ? ? ( Number of elements containing a code attribute ) / ( Number of elements in the record that could have a code attribute ) If so, then we could explore some actual cases and ask if we are getting a reasonable answer. For instance, if there were a record that contained only one element and that was a element with a code from OLAC-Language, would that mean a "code exists score" of 1.0? It would be missing 4 out of 5 core elements for a deduction of 10 * (1/5)(.8) = 1.6, yielding a total score of 8.4. If the archive contained thousands of records, all of which had only a coded subject element, then the average item score would be 8.4 for an overall rating of four stars. Have I understood the formulas correctly? If so, then I think we will need to do more work on figuring out the best way to compute the overall score. In this case a score that multiplies the percentage of core elements by the percentage of code exists would yield 2 out of 10 which sounds like a more appropriate score. A fine point on "code exists" is what it does with non-OLAC refinements. For instance, if a record contained only two elements and they were elements, one with the OLAC Linguistic Type and the other with the DCMI Type, would that score as 1.0 or 0.5 on the code exists scale? I looks to me like it would be 0.5, which is half as good as the score of a record consisting of only one coded element, but in fact, the record with two elements is a better quality record. The metric for "3. Archive Diversity" needs more thought. It is defined as: Diversity = (Distinct code values / Number instances of element) * 100 The scores for diversity with respect to the Linguistic Type code are illustrate the problem well. An archive containing only one record which is coded for one linguistic type would score 100%. Whereas an archive containing 1,000 records, all of which have a type element for the same code would score 0.1%--but the one archive is not 1000 times are diverse as the other. I'm wondering if the formula shouldn't be: Diversity = (Distinct code values / Total codes in vocabulary) * 100 Then every archive that has at least one instance of all three possible values of Linguistic Type (regardless of how many records it has) would be maximally diverse with respect to linguistic type. I think that sounds more correct. Rather than Diversity, I wonder if the concepts of Breadth and Depth would serve better. That is, the Breadth of an archive (with respect to a controlled vocabulary) would be the percentage of possible values that it has. It's Depth (with respect to that vocabulary) would be the average number of records it has for each used code value. On "7. Code usage", elements that may take a code are in focus. I think it should be the code sets (i.e. the extensions) themselves. We presently define five extensions, but some can occur with more than one element. I think there are a total of 7 element-extension combinations. I think it is those that should be analyzed here, not just the elements. For instance, can occur with Language and Linguistic Field. Those should be calculated as two separate entries in the chart. That's all I have for now, but that is plenty to get the discussion rolling. Are you going to be at the EMELD conference? If so, that might be a great opportunity for some of us to gather at a whiteboard and thrash out possible metrics. Hope to see you there, -Gary Baden Hughes Sent by: OLAC Implementers List To: cc: Subject: updated OLAC Archive Report Cards 06/27/2004 07:14 PM Please respond to Open Language Archives Community Implementers List Dear OLAC Implementers, You may recall that in March, we announced a a new service which had recently been added to the OLAC site, namely archive report cards. These give summary statistics for each repository and an assessment of the quality of the repository's metadata against both external best practice recommendations and the relative use practices within the OLAC context. An updated version of the archive report cards are now available - changes include: - updating evaluation algorithm to account for changes in DC recommendations (eg use of contributor rather than creator) - updated labelling of graphs to be more consistent with OLAC terminology The report cards can be accessed by clicking the "REPORT CARD" links on the OLAC Archives page [1]. The report is also available for the full set of repositories [2]. Information about how these reports are generated is also available [3]. Reports are updated after every harvest, every 12 hours at the current point in time. The evaluation metric rewards the use of OLAC extensions (controlled vocabularies), and what we consider to be core DC elements: title, date, subject, description, and identifier. The service was developed by Amol Kamat, Baden Hughes, and Steven Bird at the University of Melbourne, with sponsorship from the Department of Computer Science and Software Engineering. We welcome your feedback. Regards Baden Hughes [1] http://www.language-archives.org/archives.php4 [2] http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=all [3] http://www.language-archives.org/tools/reports/ExplainReport.html