From thien at UNIMELB.EDU.AU  Mon Jun 21 01:30:17 2004
From: thien at UNIMELB.EDU.AU (Nicholas Thieberger)
Date: Mon, 21 Jun 2004 11:30:17 +1000
Subject: OLAC Metadata
Message-ID: <MON.21.JUN.2004.113017.1000.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

Query from Linda Barwick following a discussion on the 'report card'
for our metadata set.

>1. creator - why do they score on this under 'Code Usage' when DC
>recommendation is to use contributor?
>"Dublin  Core now discourages the use of the Creator element,
>recommending that all Role information be  associated with
>Contributor elements." http://www.language-archives.org/REC/role.html


From badenh at CS.MU.OZ.AU  Mon Jun 21 02:51:23 2004
From: badenh at CS.MU.OZ.AU (Baden Hughes)
Date: Mon, 21 Jun 2004 12:51:23 +1000
Subject: OLAC Metadata
In-Reply-To: <a06100517bcfbea95ba79@[128.250.86.175]>
Message-ID: <MON.21.JUN.2004.125123.1000.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

It's nice to hear that someone's looking at the reports :-)

The Dublin Core recommendation is now that Role attributes be
associated with Contributor, however, the development of the
OLAC controlled vocabulary pre-dated the DC recommendation.

In fact, the OLAC Role Vocabulary describes attributes which may be
referential to both Creator and Contributor, and these elements exist
independently within OLAC records.

For the reference of others on this list, I'm referring here to:

http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=79

As to why the creator is scored under 'Code Usage': in PARADISEC's case,
where Contributor is only present, then the Creator element is in fact
redundant (hence the 0% score). In other words, in the context of this
particular archive, the OLAC Role vocabulary is only applicable to
Contributor.

However, in other archives where both Creator and Contributor are used,
both with and without the OLAC Role Vocabulary.

While it could be argued that these latter archives are not following
"best practice" as recommended by Dublin Core, its interesting to note
that overall:

http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=all

Creator is used more times than Contributor, but the OLAC Role vocabulary
is much more widely used for Contributor instances.

This type of observation is precisely why the Archive Report card was
developed, to allow archive maintainers to see where best to expend effort
in metadata quality improvment.

Regards

Baden

On Mon, 21 Jun 2004, Nicholas Thieberger wrote:

> Query from Linda Barwick following a discussion on the 'report card'
> for our metadata set.
>
> >1. creator - why do they score on this under 'Code Usage' when DC
> >recommendation is to use contributor?
> >"Dublin  Core now discourages the use of the Creator element,
> >recommending that all Role information be  associated with
> >Contributor elements." http://www.language-archives.org/REC/role.html
>


From badenh at CS.MU.OZ.AU  Mon Jun 28 00:14:15 2004
From: badenh at CS.MU.OZ.AU (Baden Hughes)
Date: Mon, 28 Jun 2004 10:14:15 +1000
Subject: updated OLAC Archive Report Cards
Message-ID: <MON.28.JUN.2004.101415.1000.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

Dear OLAC Implementers,

You may recall that in March, we announced a a new service which had
recently been added to the OLAC site, namely archive report cards.  These give
summary statistics for each repository and an assessment of the quality of
the repository's metadata against both external best practice
recommendations and the relative use practices within the OLAC context.

An updated version of the archive report cards are now available - changes
include:

- updating evaluation algorithm to account for changes in DC
recommendations (eg use of
contributor rather than creator)
- updated labelling of graphs to be more consistent with OLAC terminology

The report cards can be accessed by clicking the "REPORT CARD" links on
the OLAC Archives page [1].  The report is also available for the full set
of repositories [2].  Information about how these reports are generated
is also available [3].  Reports are updated after every harvest, every 12
hours at the current point in time.

The evaluation metric rewards the use of OLAC extensions
(controlled vocabularies), and what we
consider to be core DC elements: title, date, subject, description, and
identifier.

The service was developed by Amol Kamat, Baden Hughes, and Steven Bird
at the University of Melbourne, with sponsorship from the Department of
Computer Science and Software Engineering.  We welcome your feedback.

Regards

Baden Hughes

[1] http://www.language-archives.org/archives.php4
[2] http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=all
[3] http://www.language-archives.org/tools/reports/ExplainReport.html


From badenh at CS.MU.OZ.AU  Mon Jun 28 00:27:28 2004
From: badenh at CS.MU.OZ.AU (Baden Hughes)
Date: Mon, 28 Jun 2004 10:27:28 +1000
Subject: new OLAC-oriented search engine
Message-ID: <MON.28.JUN.2004.102728.1000.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

Dear OLAC Implementers,

A new service has recently been added to the LDC site, namely an OLAC
Search Engine [1].

This search engine complements other OLAC search engines including those
already deployed at LinguistList [2]. This instantiation takes an entirely different
approach, and aims to be very similar to key-word-based web search engines.

Other features of the search engine include:

- search by alternate names using the Ethnologue
- search by language code
- key word in context highlighting in search results
- search for similar spellings
- exact/approximate/partial string matching
- search operators (AND, OR, NOT, +, -)
- support for inline syntax (eg creator:hale)
- search for related items on Google.

The search engine results are displayed according to the quality of the
metadata on a per record and per archive basis. These rankings are derived from the same underlying
algorithm used in the OLAC Archive Reports [3].

The service was developed by Amol Kamat, Baden Hughes, and Steven Bird
at the University of Melbourne, with sponsorship from the Department of
Computer Science and Software Engineering.  We welcome your feedback.

Regards

Baden Hughes

[1] http://wave.ldc.upenn.edu/olac/search.php
[2] http://cf.linguistlist.org/cfdocs/new-website/LL-WorkingDirs/olac/olac-search-advanced.cfm
[3] http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=all


From hdry at LINGUISTLIST.ORG  Mon Jun 28 13:08:49 2004
From: hdry at LINGUISTLIST.ORG (Helen Aristar-Dry)
Date: Mon, 28 Jun 2004 09:08:49 -0400
Subject: new OLAC-oriented search engine
In-Reply-To: <Pine.GSO.4.56.0406281025310.2707@mulga.cs.mu.OZ.AU>
Message-ID: <MON.28.JUN.2004.090849.0400.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

Hi, Baden,
The ldc link doesn't seem to be working.
-Helen


Quoting Baden Hughes <badenh at CS.MU.OZ.AU>:

> Dear OLAC Implementers,
>
> A new service has recently been added to the LDC site, namely an OLAC
> Search Engine [1].
>
> This search engine complements other OLAC search engines including those
> already deployed at LinguistList [2]. This instantiation takes an entirely
> different
> approach, and aims to be very similar to key-word-based web search engines.
>
> Other features of the search engine include:
>
> - search by alternate names using the Ethnologue
> - search by language code
> - key word in context highlighting in search results
> - search for similar spellings
> - exact/approximate/partial string matching
> - search operators (AND, OR, NOT, +, -)
> - support for inline syntax (eg creator:hale)
> - search for related items on Google.
>
> The search engine results are displayed according to the quality of the
> metadata on a per record and per archive basis. These rankings are derived
> from the same underlying
> algorithm used in the OLAC Archive Reports [3].
>
> The service was developed by Amol Kamat, Baden Hughes, and Steven Bird
> at the University of Melbourne, with sponsorship from the Department of
> Computer Science and Software Engineering.  We welcome your feedback.
>
> Regards
>
> Baden Hughes
>
> [1] http://wave.ldc.upenn.edu/olac/search.php
> [2]
>
http://cf.linguistlist.org/cfdocs/new-website/LL-WorkingDirs/olac/olac-search-advanced.cfm
> [3]
>
http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=all
>


-Helen

Helen Aristar-Dry
Prof. of Linguistics
Eastern Michigan University
Ypsilanti, MI 48103
hdry at linguistlist.org
734.487.0144 (office)  734.741.1567 (home)

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


From haejoong at LDC.UPENN.EDU  Mon Jun 28 14:26:55 2004
From: haejoong at LDC.UPENN.EDU (Haejoong Lee)
Date: Mon, 28 Jun 2004 10:26:55 -0400
Subject: new OLAC-oriented search engine
In-Reply-To: <1088428129.40e018614fd68@webmail.linguistlist.org>
Message-ID: <MON.28.JUN.2004.102655.0400.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

On Mon, Jun 28, 2004 at 09:08:49AM -0400, Helen Aristar-Dry wrote:
> Hi, Baden,
> The ldc link doesn't seem to be working.

Hi Helen,

LDC is experiencing intermittent network connectivity with locations
outside of the Penn Network.  I hope this can be resolved soon.

-Haejoong


From badenh at CS.MU.OZ.AU  Tue Jun 29 11:47:46 2004
From: badenh at CS.MU.OZ.AU (Baden Hughes)
Date: Tue, 29 Jun 2004 21:47:46 +1000
Subject: correction: new OLAC-oriented search engine URL
Message-ID: <TUE.29.JUN.2004.214746.1000.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

The URL for the OLAC-oriented search engine was incorrectly specified
earlier. The correct URL is:

http://www.ldc.upenn.edu/olac/search.php

Apologies for any inconvenience.

Regards

Baden


From Gary_Simons at SIL.ORG  Wed Jun 30 02:25:08 2004
From: Gary_Simons at SIL.ORG (Gary Simons)
Date: Tue, 29 Jun 2004 21:25:08 -0500
Subject: updated OLAC Archive Report Cards
In-Reply-To: <Pine.GSO.4.56.0406281012280.1882@mulga.cs.mu.OZ.AU>
Message-ID: <TUE.29.JUN.2004.212508.0500.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

Baden,

Thanks for all your work on this report card system.  I think it is a great
idea and trust that it will serve to help all of us improve the quality of
the metadata we are publishing.

I'm the implementer for two archives, one of which ends up with a five-star
rating, while the other ends up with just two.  My intuition is that they
are not that different in quality, so I'm trying to understand the scoring
system to see what accounts for the huge difference. I've also reviewed our
archive report card with Joan Spanne, our archivist, to see what feedback
she might have.  She is actually responsible for many of the observations
in this note.

I'm looking at the documentation page and find that the explanation in "2.
Star Rating" isn't enough information to make it clear.  That says it is
based on the "average item score out of 10".  I'm not sure what that means.
A natural conclusion would be that each of the remaining outline points in
the document is an item, and the overall rating is based on the average of
those.  But I don't think that is what it means, since those don't describe
things scored on the basis of 0 to 10.

10 point scoring seems to appear only in "4. Metadata Quality", and that
section does talk about "items", so is it the case that the star rating
deals only with point 4 on metadata quality.  If so, the discussion in
point 2 should make this explicit.

If I'm on the right track that the items that are averaged are just those
in point 4, I'm still not completely clear on what constitutes an "item".
... Okay, as I look back and forth some more, I'm developing a new
hypothesis, namely, that "item" refers to the record returned for one
archive holding.  (Up to this point I was thinking it meant, "one of the
things on our checklist of quality points to check for".)  So, does that
mean, each harvested record from the archive is given a quality score from
0 to 10, and the stars are based on the average for all records?  That is
starting to sound more likely.

In that case, it still seems like the stars are based only on "4. Metadata
quality".  Now I think I'm understanding what the quality metric is, but I
want to make sure.  The first part of it is:

Code exists score =              ( Number of elements containing code
attributes ) / ( Number of elements in the record of type with associated
code )

Does this mean:

Code exists score =              ( Number of elements containing a code
attribute ) / ( Number of elements in the record that could have a code
attribute )

If so, then we could explore some actual cases and ask if we are getting a
reasonable answer.  For instance, if there were a record that contained
only one element and that was a <subject> element with a code from
OLAC-Language, would that mean a "code exists score" of 1.0?  It would be
missing 4 out of 5 core elements for a deduction of 10 * (1/5)(.8) = 1.6,
yielding a total score of 8.4.  If the archive contained thousands of
records, all of which had only a coded subject element, then the average
item score would be 8.4 for an overall rating of four stars.  Have I
understood the formulas correctly?  If so, then I think we will need to do
more work on figuring out the best way to compute the overall score.  In
this case a score that multiplies the percentage of core elements by the
percentage of code exists would yield 2 out of 10 which sounds like a more
appropriate score.

A fine point on "code exists" is what it does with non-OLAC refinements.
For instance, if a record contained only two elements and they were <type>
elements, one with the OLAC Linguistic Type and the other with the DCMI
Type, would that score as 1.0 or 0.5 on the code exists scale?  I looks to
me like it would be 0.5, which is half as good as the score of a record
consisting of only one coded <type> element, but in fact, the record with
two <type> elements is a better quality record.

The metric for "3. Archive Diversity" needs more thought.  It is defined
as:

Diversity = (Distinct code values / Number instances of element) * 100

The scores for diversity with respect to the Linguistic Type code are
illustrate the problem well.  An archive containing only one record which
is coded for one linguistic type would score 100%.  Whereas an archive
containing 1,000 records, all of which have a type element for the same
code would score 0.1%--but the one archive is not 1000 times are diverse as
the other.  I'm wondering if the formula shouldn't be:

Diversity = (Distinct code values / Total codes in vocabulary) * 100

Then every archive that has at least one instance of all three possible
values of Linguistic Type (regardless of how many records it has) would be
maximally diverse with respect to linguistic type.  I think that sounds
more correct.

Rather than Diversity, I wonder if the concepts of Breadth and Depth would
serve better.  That is, the Breadth of an archive (with respect to a
controlled vocabulary) would be the percentage of possible values that it
has.  It's Depth (with respect to that vocabulary) would be the average
number of records it has for each used code value.

On "7. Code usage", elements that may take a code are in focus.  I think it
should be the code sets (i.e. the extensions) themselves.  We presently
define five extensions, but some can occur with more than one element.  I
think there are a total of 7 element-extension combinations.  I think it is
those that should be analyzed here, not just the elements.  For instance,
<subject> can occur with Language and Linguistic Field.  Those should be
calculated as two separate entries in the chart.

That's all I have for now, but that is plenty to get the discussion
rolling.  Are you going to be at the EMELD conference?  If so, that might
be a great opportunity for some of us to gather at a whiteboard and thrash
out possible metrics.

Hope to see you there,
-Gary


                      Baden Hughes <badenh at CS.MU.OZ.AU>                
                      Sent by: OLAC Implementers List            To:   
                      <OLAC-IMPLEMENTERS at LISTSERV.LINGUI         OLAC-IMPLEMENTERS at LISTSERV.LINGUISTLIST.ORG
                      STLIST.ORG>                                cc:   
                                                                 Subject: updated OLAC Archive Report Cards
                                                                       
                      06/27/2004 07:14 PM                              
                      Please respond to Open Language                  
                      Archives Community Implementers                  
                      List                                             
                                                                       

Dear OLAC Implementers,

You may recall that in March, we announced a a new service which had
recently been added to the OLAC site, namely archive report cards.  These
give
summary statistics for each repository and an assessment of the quality of
the repository's metadata against both external best practice
recommendations and the relative use practices within the OLAC context.

An updated version of the archive report cards are now available - changes
include:

- updating evaluation algorithm to account for changes in DC
recommendations (eg use of
contributor rather than creator)
- updated labelling of graphs to be more consistent with OLAC terminology

The report cards can be accessed by clicking the "REPORT CARD" links on
the OLAC Archives page [1].  The report is also available for the full set
of repositories [2].  Information about how these reports are generated
is also available [3].  Reports are updated after every harvest, every 12
hours at the current point in time.

The evaluation metric rewards the use of OLAC extensions
(controlled vocabularies), and what we
consider to be core DC elements: title, date, subject, description, and
identifier.

The service was developed by Amol Kamat, Baden Hughes, and Steven Bird
at the University of Melbourne, with sponsorship from the Department of
Computer Science and Software Engineering.  We welcome your feedback.

Regards

Baden Hughes

[1] http://www.language-archives.org/archives.php4
[2]
http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=all

[3] http://www.language-archives.org/tools/reports/ExplainReport.html


From thien at UNIMELB.EDU.AU  Mon Jun 21 01:30:17 2004
From: thien at UNIMELB.EDU.AU (Nicholas Thieberger)
Date: Mon, 21 Jun 2004 11:30:17 +1000
Subject: OLAC Metadata
Message-ID: <MON.21.JUN.2004.113017.1000.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

Query from Linda Barwick following a discussion on the 'report card'
for our metadata set.

>1. creator - why do they score on this under 'Code Usage' when DC
>recommendation is to use contributor?
>"Dublin  Core now discourages the use of the Creator element,
>recommending that all Role information be  associated with
>Contributor elements." http://www.language-archives.org/REC/role.html


From badenh at CS.MU.OZ.AU  Mon Jun 21 02:51:23 2004
From: badenh at CS.MU.OZ.AU (Baden Hughes)
Date: Mon, 21 Jun 2004 12:51:23 +1000
Subject: OLAC Metadata
In-Reply-To: <a06100517bcfbea95ba79@[128.250.86.175]>
Message-ID: <MON.21.JUN.2004.125123.1000.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

It's nice to hear that someone's looking at the reports :-)

The Dublin Core recommendation is now that Role attributes be
associated with Contributor, however, the development of the
OLAC controlled vocabulary pre-dated the DC recommendation.

In fact, the OLAC Role Vocabulary describes attributes which may be
referential to both Creator and Contributor, and these elements exist
independently within OLAC records.

For the reference of others on this list, I'm referring here to:

http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=79

As to why the creator is scored under 'Code Usage': in PARADISEC's case,
where Contributor is only present, then the Creator element is in fact
redundant (hence the 0% score). In other words, in the context of this
particular archive, the OLAC Role vocabulary is only applicable to
Contributor.

However, in other archives where both Creator and Contributor are used,
both with and without the OLAC Role Vocabulary.

While it could be argued that these latter archives are not following
"best practice" as recommended by Dublin Core, its interesting to note
that overall:

http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=all

Creator is used more times than Contributor, but the OLAC Role vocabulary
is much more widely used for Contributor instances.

This type of observation is precisely why the Archive Report card was
developed, to allow archive maintainers to see where best to expend effort
in metadata quality improvment.

Regards

Baden

On Mon, 21 Jun 2004, Nicholas Thieberger wrote:

> Query from Linda Barwick following a discussion on the 'report card'
> for our metadata set.
>
> >1. creator - why do they score on this under 'Code Usage' when DC
> >recommendation is to use contributor?
> >"Dublin  Core now discourages the use of the Creator element,
> >recommending that all Role information be  associated with
> >Contributor elements." http://www.language-archives.org/REC/role.html
>


From badenh at CS.MU.OZ.AU  Mon Jun 28 00:14:15 2004
From: badenh at CS.MU.OZ.AU (Baden Hughes)
Date: Mon, 28 Jun 2004 10:14:15 +1000
Subject: updated OLAC Archive Report Cards
Message-ID: <MON.28.JUN.2004.101415.1000.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

Dear OLAC Implementers,

You may recall that in March, we announced a a new service which had
recently been added to the OLAC site, namely archive report cards.  These give
summary statistics for each repository and an assessment of the quality of
the repository's metadata against both external best practice
recommendations and the relative use practices within the OLAC context.

An updated version of the archive report cards are now available - changes
include:

- updating evaluation algorithm to account for changes in DC
recommendations (eg use of
contributor rather than creator)
- updated labelling of graphs to be more consistent with OLAC terminology

The report cards can be accessed by clicking the "REPORT CARD" links on
the OLAC Archives page [1].  The report is also available for the full set
of repositories [2].  Information about how these reports are generated
is also available [3].  Reports are updated after every harvest, every 12
hours at the current point in time.

The evaluation metric rewards the use of OLAC extensions
(controlled vocabularies), and what we
consider to be core DC elements: title, date, subject, description, and
identifier.

The service was developed by Amol Kamat, Baden Hughes, and Steven Bird
at the University of Melbourne, with sponsorship from the Department of
Computer Science and Software Engineering.  We welcome your feedback.

Regards

Baden Hughes

[1] http://www.language-archives.org/archives.php4
[2] http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=all
[3] http://www.language-archives.org/tools/reports/ExplainReport.html


From badenh at CS.MU.OZ.AU  Mon Jun 28 00:27:28 2004
From: badenh at CS.MU.OZ.AU (Baden Hughes)
Date: Mon, 28 Jun 2004 10:27:28 +1000
Subject: new OLAC-oriented search engine
Message-ID: <MON.28.JUN.2004.102728.1000.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

Dear OLAC Implementers,

A new service has recently been added to the LDC site, namely an OLAC
Search Engine [1].

This search engine complements other OLAC search engines including those
already deployed at LinguistList [2]. This instantiation takes an entirely different
approach, and aims to be very similar to key-word-based web search engines.

Other features of the search engine include:

- search by alternate names using the Ethnologue
- search by language code
- key word in context highlighting in search results
- search for similar spellings
- exact/approximate/partial string matching
- search operators (AND, OR, NOT, +, -)
- support for inline syntax (eg creator:hale)
- search for related items on Google.

The search engine results are displayed according to the quality of the
metadata on a per record and per archive basis. These rankings are derived from the same underlying
algorithm used in the OLAC Archive Reports [3].

The service was developed by Amol Kamat, Baden Hughes, and Steven Bird
at the University of Melbourne, with sponsorship from the Department of
Computer Science and Software Engineering.  We welcome your feedback.

Regards

Baden Hughes

[1] http://wave.ldc.upenn.edu/olac/search.php
[2] http://cf.linguistlist.org/cfdocs/new-website/LL-WorkingDirs/olac/olac-search-advanced.cfm
[3] http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=all


From hdry at LINGUISTLIST.ORG  Mon Jun 28 13:08:49 2004
From: hdry at LINGUISTLIST.ORG (Helen Aristar-Dry)
Date: Mon, 28 Jun 2004 09:08:49 -0400
Subject: new OLAC-oriented search engine
In-Reply-To: <Pine.GSO.4.56.0406281025310.2707@mulga.cs.mu.OZ.AU>
Message-ID: <MON.28.JUN.2004.090849.0400.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

Hi, Baden,
The ldc link doesn't seem to be working.
-Helen


Quoting Baden Hughes <badenh at CS.MU.OZ.AU>:

> Dear OLAC Implementers,
>
> A new service has recently been added to the LDC site, namely an OLAC
> Search Engine [1].
>
> This search engine complements other OLAC search engines including those
> already deployed at LinguistList [2]. This instantiation takes an entirely
> different
> approach, and aims to be very similar to key-word-based web search engines.
>
> Other features of the search engine include:
>
> - search by alternate names using the Ethnologue
> - search by language code
> - key word in context highlighting in search results
> - search for similar spellings
> - exact/approximate/partial string matching
> - search operators (AND, OR, NOT, +, -)
> - support for inline syntax (eg creator:hale)
> - search for related items on Google.
>
> The search engine results are displayed according to the quality of the
> metadata on a per record and per archive basis. These rankings are derived
> from the same underlying
> algorithm used in the OLAC Archive Reports [3].
>
> The service was developed by Amol Kamat, Baden Hughes, and Steven Bird
> at the University of Melbourne, with sponsorship from the Department of
> Computer Science and Software Engineering.  We welcome your feedback.
>
> Regards
>
> Baden Hughes
>
> [1] http://wave.ldc.upenn.edu/olac/search.php
> [2]
>
http://cf.linguistlist.org/cfdocs/new-website/LL-WorkingDirs/olac/olac-search-advanced.cfm
> [3]
>
http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=all
>


-Helen

Helen Aristar-Dry
Prof. of Linguistics
Eastern Michigan University
Ypsilanti, MI 48103
hdry at linguistlist.org
734.487.0144 (office)  734.741.1567 (home)

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


From haejoong at LDC.UPENN.EDU  Mon Jun 28 14:26:55 2004
From: haejoong at LDC.UPENN.EDU (Haejoong Lee)
Date: Mon, 28 Jun 2004 10:26:55 -0400
Subject: new OLAC-oriented search engine
In-Reply-To: <1088428129.40e018614fd68@webmail.linguistlist.org>
Message-ID: <MON.28.JUN.2004.102655.0400.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

On Mon, Jun 28, 2004 at 09:08:49AM -0400, Helen Aristar-Dry wrote:
> Hi, Baden,
> The ldc link doesn't seem to be working.

Hi Helen,

LDC is experiencing intermittent network connectivity with locations
outside of the Penn Network.  I hope this can be resolved soon.

-Haejoong


From badenh at CS.MU.OZ.AU  Tue Jun 29 11:47:46 2004
From: badenh at CS.MU.OZ.AU (Baden Hughes)
Date: Tue, 29 Jun 2004 21:47:46 +1000
Subject: correction: new OLAC-oriented search engine URL
Message-ID: <TUE.29.JUN.2004.214746.1000.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

The URL for the OLAC-oriented search engine was incorrectly specified
earlier. The correct URL is:

http://www.ldc.upenn.edu/olac/search.php

Apologies for any inconvenience.

Regards

Baden


From Gary_Simons at SIL.ORG  Wed Jun 30 02:25:08 2004
From: Gary_Simons at SIL.ORG (Gary Simons)
Date: Tue, 29 Jun 2004 21:25:08 -0500
Subject: updated OLAC Archive Report Cards
In-Reply-To: <Pine.GSO.4.56.0406281012280.1882@mulga.cs.mu.OZ.AU>
Message-ID: <TUE.29.JUN.2004.212508.0500.OLACIMPLEMENTERS@LISTSERV.LINGUISTLIST.ORG>

Baden,

Thanks for all your work on this report card system.  I think it is a great
idea and trust that it will serve to help all of us improve the quality of
the metadata we are publishing.

I'm the implementer for two archives, one of which ends up with a five-star
rating, while the other ends up with just two.  My intuition is that they
are not that different in quality, so I'm trying to understand the scoring
system to see what accounts for the huge difference. I've also reviewed our
archive report card with Joan Spanne, our archivist, to see what feedback
she might have.  She is actually responsible for many of the observations
in this note.

I'm looking at the documentation page and find that the explanation in "2.
Star Rating" isn't enough information to make it clear.  That says it is
based on the "average item score out of 10".  I'm not sure what that means.
A natural conclusion would be that each of the remaining outline points in
the document is an item, and the overall rating is based on the average of
those.  But I don't think that is what it means, since those don't describe
things scored on the basis of 0 to 10.

10 point scoring seems to appear only in "4. Metadata Quality", and that
section does talk about "items", so is it the case that the star rating
deals only with point 4 on metadata quality.  If so, the discussion in
point 2 should make this explicit.

If I'm on the right track that the items that are averaged are just those
in point 4, I'm still not completely clear on what constitutes an "item".
... Okay, as I look back and forth some more, I'm developing a new
hypothesis, namely, that "item" refers to the record returned for one
archive holding.  (Up to this point I was thinking it meant, "one of the
things on our checklist of quality points to check for".)  So, does that
mean, each harvested record from the archive is given a quality score from
0 to 10, and the stars are based on the average for all records?  That is
starting to sound more likely.

In that case, it still seems like the stars are based only on "4. Metadata
quality".  Now I think I'm understanding what the quality metric is, but I
want to make sure.  The first part of it is:

Code exists score =  ? ? ? ? ? ? ( Number of elements containing code
attributes ) / ( Number of elements in the record of type with associated
code )

Does this mean:

Code exists score =  ? ? ? ? ? ? ( Number of elements containing a code
attribute ) / ( Number of elements in the record that could have a code
attribute )

If so, then we could explore some actual cases and ask if we are getting a
reasonable answer.  For instance, if there were a record that contained
only one element and that was a <subject> element with a code from
OLAC-Language, would that mean a "code exists score" of 1.0?  It would be
missing 4 out of 5 core elements for a deduction of 10 * (1/5)(.8) = 1.6,
yielding a total score of 8.4.  If the archive contained thousands of
records, all of which had only a coded subject element, then the average
item score would be 8.4 for an overall rating of four stars.  Have I
understood the formulas correctly?  If so, then I think we will need to do
more work on figuring out the best way to compute the overall score.  In
this case a score that multiplies the percentage of core elements by the
percentage of code exists would yield 2 out of 10 which sounds like a more
appropriate score.

A fine point on "code exists" is what it does with non-OLAC refinements.
For instance, if a record contained only two elements and they were <type>
elements, one with the OLAC Linguistic Type and the other with the DCMI
Type, would that score as 1.0 or 0.5 on the code exists scale?  I looks to
me like it would be 0.5, which is half as good as the score of a record
consisting of only one coded <type> element, but in fact, the record with
two <type> elements is a better quality record.

The metric for "3. Archive Diversity" needs more thought.  It is defined
as:

Diversity = (Distinct code values / Number instances of element) * 100

The scores for diversity with respect to the Linguistic Type code are
illustrate the problem well.  An archive containing only one record which
is coded for one linguistic type would score 100%.  Whereas an archive
containing 1,000 records, all of which have a type element for the same
code would score 0.1%--but the one archive is not 1000 times are diverse as
the other.  I'm wondering if the formula shouldn't be:

Diversity = (Distinct code values / Total codes in vocabulary) * 100

Then every archive that has at least one instance of all three possible
values of Linguistic Type (regardless of how many records it has) would be
maximally diverse with respect to linguistic type.  I think that sounds
more correct.

Rather than Diversity, I wonder if the concepts of Breadth and Depth would
serve better.  That is, the Breadth of an archive (with respect to a
controlled vocabulary) would be the percentage of possible values that it
has.  It's Depth (with respect to that vocabulary) would be the average
number of records it has for each used code value.

On "7. Code usage", elements that may take a code are in focus.  I think it
should be the code sets (i.e. the extensions) themselves.  We presently
define five extensions, but some can occur with more than one element.  I
think there are a total of 7 element-extension combinations.  I think it is
those that should be analyzed here, not just the elements.  For instance,
<subject> can occur with Language and Linguistic Field.  Those should be
calculated as two separate entries in the chart.

That's all I have for now, but that is plenty to get the discussion
rolling.  Are you going to be at the EMELD conference?  If so, that might
be a great opportunity for some of us to gather at a whiteboard and thrash
out possible metrics.

Hope to see you there,
-Gary


                      Baden Hughes <badenh at CS.MU.OZ.AU>                
                      Sent by: OLAC Implementers List            To:   
                      <OLAC-IMPLEMENTERS at LISTSERV.LINGUI         OLAC-IMPLEMENTERS at LISTSERV.LINGUISTLIST.ORG
                      STLIST.ORG>                                cc:   
                                                                 Subject: updated OLAC Archive Report Cards
                                                                       
                      06/27/2004 07:14 PM                              
                      Please respond to Open Language                  
                      Archives Community Implementers                  
                      List                                             
                                                                       

Dear OLAC Implementers,

You may recall that in March, we announced a a new service which had
recently been added to the OLAC site, namely archive report cards.  These
give
summary statistics for each repository and an assessment of the quality of
the repository's metadata against both external best practice
recommendations and the relative use practices within the OLAC context.

An updated version of the archive report cards are now available - changes
include:

- updating evaluation algorithm to account for changes in DC
recommendations (eg use of
contributor rather than creator)
- updated labelling of graphs to be more consistent with OLAC terminology

The report cards can be accessed by clicking the "REPORT CARD" links on
the OLAC Archives page [1].  The report is also available for the full set
of repositories [2].  Information about how these reports are generated
is also available [3].  Reports are updated after every harvest, every 12
hours at the current point in time.

The evaluation metric rewards the use of OLAC extensions
(controlled vocabularies), and what we
consider to be core DC elements: title, date, subject, description, and
identifier.

The service was developed by Amol Kamat, Baden Hughes, and Steven Bird
at the University of Melbourne, with sponsorship from the Department of
Computer Science and Software Engineering.  We welcome your feedback.

Regards

Baden Hughes

[1] http://www.language-archives.org/archives.php4
[2]
http://www.language-archives.org/tools/reports/archiveReportCard.php?archive=all

[3] http://www.language-archives.org/tools/reports/ExplainReport.html