[Corpora-List] Corpus Benevolence

Sun Feb 11 07:16:38 UTC 2007

On 2/10/07, Adam Kilgarriff <adam at lexmasterclass.com> wrote:
> - how do you describe a corpus?

One minimalist answer to this question is "Use OLAC Metadata", because
it provides uniform descriptors that help with resource discovery.

OLAC, the Open Language Archives Community, is an international
partnership of institutions and individuals who are creating a
worldwide virtual library of language resources by: (i) developing
consensus on best current practice for the digital archiving of
language resources, and (ii) developing a network of interoperating
repositories and services for housing and accessing such resources.
http://www.language-archives.org/

OLAC extends Dublin Core Metadata by providing vocabularies for
describing language resources, including language identification,
linguistic data type, discourse type, and linguistic subject.
http://www.language-archives.org/REC/olac-extensions.html

Many repositories of language resources categorize their holdings
using OLAC Metadata, including LDC, SIL, Linguist List, Rosetta
Project, Talkbank...  http://www.language-archives.org/archives.php4

Once corpora are categorized in this way they can be searched.  OLAC
has a federated search service that permits all repositories to be
searched simultaneously.  (Part of the inspiration for this was all
the queries for obscure resources that have appeared on this list.)
http://www.language-archives.org/tools/search/

A paper that synthesizes all this appeared in the Literary and
Linguistic Computing journal:
Simons, Gary and Steven Bird (2003).  The Open Language Archives
Community: An infrastructure for distributed archiving of language
resources. Literary and Linguistic Computing 18: 117-128.
http://arxiv.org/abs/cs.CL/0306040

-Steven Bird