Corpora: Summary: Corpus metadata

Mikko Lounela mlounela at kotus.fi
Mon Jun 24 07:38:16 UTC 2002


Hi there again.

about two weeks ago I posted a query about corpus metadata. I also
promised to post a summary. Thank you very much for the answers (total
8), and here is the summary.

      - Mikko

Here is the original query:

>>From mlounela at kotus.fi Mon Jun 24 09:41:28 2002
>Date: Wed, 5 Jun 2002 13:36:14 +0300 (EET DST)
>From: Mikko Lounela <mlounela at kotus.fi>
>To: CORPORA at HD.UIB.NO
>Subject: Corpus metadata
>
>
>Hello everybody.
>
>I am currently trying to figure out what information to include in text
>corpora metadata. At this point, I'm trying to collect references. So, if
>you have any to share, I would be most grateful. Summary will follow.
>
>	- Mikko Lounela


Here is a brief summary:

Paul Clough recommended two books:
Corpus Linguistics (1996), Tony McEnery and Andrew Wilson, Edinburgh
textbooks in empirical linguistics. and
Corpus Annotation (1997), Roger Garside, Geoffrey Leech and Tony McEnery,
Longman.

Mickel Grönroos told that the Language Bank of Finland uses a metadata
set that resembles Dublin Core
(<http://www.dublincore.org/documents/1999/07/02/dces/>).

Lou Burnard guided to the TEI guidelines
(<http://www.tei-c.org/Guidelines>, in particular chapters 5 and 23).

Manne Miettinen told to have a look at IMDI and OLAC
(<http://www.mpi.nl/ISLE/index.html>,
<http://www.language-archives.org/>)

Rita Simpson recommended articles by Simpson & Powell in the book
edited by Rita Simpson & John Swales, Corpus Linguistics in North
America: Selections from the 1999 Symposium, 2001, Univ. of Michigan
Press and another article by Simpson, Lucka & Ovens in the proceedings
volume of TALC 1998, edited by Burnard & McEnery.

Sven Hartrumpf suggested the Corpus Encoding Standard
(<http://www.cs.vassar.edu/CES/>
esp. <http://www.cs.vassar.edu/CES/CES1-3.html>).

Martin Wynne gave a few pointers, which were the TEI guidelines, BNC
User Reference Guide section 8
(<http://www.hcu.ox.ac.uk/BNC/World/HTML/cdifhd.html>), OLAC, and also
mentioned a seminar to be held at the Oxfrod Text Archive
(<http://www.oucs.ox.ac.uk/ltg/courses/summer/documents/corpora.htm>)

Truus Kruyt recommended Kruyt & Dutilh 1997 at <www.inl.nl> sub
Publications.

Here are all the answers (some in Finnish):

**************************************
>From p.clough at dcs.shef.ac.uk
Mon Jun 24 09:42:46 2002 Date: Wed, 5 Jun 2002 12:02:05 +0100 From:
Paul Clough <p.clough at dcs.shef.ac.uk> To: Mikko Lounela
<mlounela at kotus.fi> Subject: Re: Corpora: Corpus metadata

Mikko,

Two references for you:

Corpus Linguistics (1996), Tony McEnery and Andrew Wilson, Edinburgh
textbooks in empirical linguistics.

Corpus Annotation (1997), Roger Garside, Geoffrey Leech and Tony McEnery,
Longman.

These both mention meta-linguistic information.

Best,

Paul.

----------------------------------------------------------------------------
---------------------
Paul Clough

Natural Language Processing Group,
Department of Computer Science,
University of Sheffield,
G35 Regent Court,
211 Portobello Street,
SHEFFIELD,
S1 4DP.

**************************************
>From mickel at csc.fi Mon Jun 24 09:42:57 2002
Date: Wed, 5 Jun 2002 14:00:11 +0300 (EEST)
From: Mickel Grönroos <mickel at csc.fi>
To: Mikko Lounela <mlounela at kotus.fi>
Subject: Re: Corpora: Corpus metadata

Mikko,

Hyvä että kysyit. Tästä oli paljon puhetta LREC-konferenssissa viime
viikolla. Toivottavasti saat hyvä vastauksia. Lemmie-ohjelmisto hyväksyy
Dublin Coren-tapaista dokumenttitason metadatana. Jokaisella
korpusdokumentilla voi olla seuraavat tiedot:

contributor	(esim. täggääjä)
creator		(esim. artikkelin kirjoittaja)
date		(esim. julkaisupäivämäärä)
id		(dokkarin uniikki id)
language	(kieli, esim, fi_FI)
publisher	(esim. Helsinki Media)
source		(esim. Helsingin Sanomat)
subject		(asiasanat, vapaasti määriteltävissä)
title		(esim. "Ahtisaari söi makkaraa")
type		(tämä on meillä vielä auki, pitäisi olla jokin
		määritelty subject-field arvo, mutta ei olla löydetty
		mistään hyvää määritelmää - tämä on oikeastaan suurin
		ongelma minun mielestäni. Voisi olla jotain tyyliin:
		"written::fact::newspaper::daily::sports::golf")

Eli type-arvolle pitäisi määritellä jokin kontrolloitu sanasto mitä saa
käyttää.

Tässä linkki:

	http://www.dublincore.org/documents/1999/07/02/dces/

(Jos haluaa liittää metadataa muille tasoille (esim. kappaleille,
virkkeille jne.) kuin dokumenteille, niin
pitänee kehittää jokin muu metadatasetti). Uskon ett dublin core toimii
ihan yhtä hyvin dokkareiten kuin kokonaisten aineistojen koodaamiseksi.)

/Mickel

**************************************
>From lou.burnard at computing-services.oxford.ac.uk Mon Jun 24 09:43:16 2002
Date: Wed, 5 Jun 2002 12:03:33 +0100
From: Lou Burnard <lou.burnard at computing-services.oxford.ac.uk>
To: Mikko Lounela <mlounela at kotus.fi>
Subject: Re: Corpora: Corpus metadata

Extensively covered in the TEI Guidelines. See
http://www.tei-c.org/Guidelines in particular chapters 5 and 23

**************************************
>From manne.miettinen at csc.fi Mon Jun 24 09:43:31 2002
Date: Wed, 5 Jun 2002 14:13:30 +0300 (EEST)
From: Manne Miettinen <manne.miettinen at csc.fi>
Subject: Re: Corpora: Corpus metadata

Hei,

Kieliaineistojen metadata on ajankohtainen hanke. Tällä
hetkellä on ainakin kaksi kilpailevaa ehdotusta:
eurooppalainen IMDI ja amerikkalainien OLAC.

Olen alustavasti taipuvainen kannattamaan ensiksi mainittua,
koska se on laajempi ja mapattavissa jälkimmäiseen. Olac
käyttää modifioitua Dublin Corea, mutta se on IMDI:n
kehittäjien (Max Planc Institute for Psycholinguistics)
mielestä liian väljää ja monitulkintaista kieliaineistojen
kuvailuun.

Tutustukaa itse osoitteissa

http://www.mpi.nl/ISLE/index.html
http://www.language-archives.org/
--

CSC - Scientific Computing Ltd	 Manne MIETTINEN
PO BOX 405 (Tekniikantie 15 a D) manne.miettinen at csc.fi
FIN-02101 Espoo			 tel. +358 9 457 2517
FINLAND				 gsm. +358 050 381 9510
**************************************
>From ritacsim at umich.edu Mon Jun 24 09:43:41 2002
Date: Wed, 5 Jun 2002 08:03:24 -0400 (EDT)
From: Rita Carol Simpson <ritacsim at umich.edu>
To: Mikko Lounela <mlounela at kotus.fi>
Subject: Re: Corpora: Corpus metadata

Hello,
by text corpus do you mean only written texts? If not, see my article
(Simpson & Powell) in the book edited by me & John Swales (Corpus
Linguistics in North America: Selections from the 1999 Symposium, 2001,
Univ. of Michigan Press). Also another article (by Simpson, Lucka & Ovens)
in the proceedings volume of TALC 1998, edited by Burnard & McEnery.

Rita Simpson

_________________________________________________________________________

Rita Simpson, PhD.
Project Manager, Michigan Corpus of Academic Spoken English
English Language Institute, University of Michigan   TEL:  734-763-7133
www.lsa.umich.edu/eli/micase/micase.htm      www.hti.umich.edu/m/micase/
_________________________________________________________________________

**************************************
>From Sven.Hartrumpf at FernUni-Hagen.de Mon Jun 24 09:43:55 2002
Date: Wed, 05 Jun 2002 18:07:47 +0200 (CEST)
From: Sven Hartrumpf <Sven.Hartrumpf at FernUni-Hagen.de>
To: mlounela at kotus.fi
Subject: Re: Corpora: Corpus metadata

Dear Mikko Lounela.
The Corpus Encoding Standard is a successful standard for corpora and
as such includes a sensible set of metadata attributes.
Please read
http://www.cs.vassar.edu/CES/
esp. http://www.cs.vassar.edu/CES/CES1-3.html
Greetings
Sven
**************************************

>From martin.wynne at ota.ahds.ac.uk Mon Jun 24 09:44:31 2002
Date: Thu, 6 Jun 2002 15:39:08 +0100
From: Martin Wynne <martin.wynne at ota.ahds.ac.uk>
To: 'Mikko Lounela' <mlounela at kotus.fi>
Subject: RE: Corpora: Corpus metadata

I have a few pointers which may help. I have attached a copy of one of the
headers we use for the electronic texts and text corpora in the Oxford Text
Archive. These follow the Text Encoding Initiative guidelines
(http://www.hcu.ox.ac.uk/TEI/P4X/HD.html).

I also refer you to the British National Corpus User Reference Guide, which
has a detailed description of the metadata there in section 8
(http://www.hcu.ox.ac.uk/BNC/World/HTML/cdifhd.html).

You should also take a look at the Open Language Archives Community metadata
set. This is a new initiative promoting a standardised way of describing
basic information about language resources so that the information from
different archives and other data providers can be shared. You can find out
more at
http://www.language-archives.org/.

You may also be interested in a one-day seminar being held here at the
Oxfrod Text Archive which will take a look at this issue in some detail. You
can find more information on this event at
http://www.oucs.ox.ac.uk/ltg/courses/summer/documents/corpora.htm.

Best wishes,
Martin

__
Martin Wynne
martin.wynne at ota.ahds.ac.uk
Linguistics Officer
Oxford Text Archive

Oxford University Computing Services
13 Banbury Road
Oxford
UK - OX2 6NN
Tel: +44 1865 283299
Fax: +44 1865 273275


<TEIHEADER>
<FILEDESC>
<TITLESTMT><TITLE TYPE="main">The Lampeter Corpus of Early Modern English
Tracts
</TITLE>
<AUTHOR>Corpora, Corpus</AUTHOR>
<EDITOR>Josef Schmied, Claudia Claridge, Rainer Siemund</EDITOR>
<FUNDER>Deutsche Forschungsgemeinschaft (DFG)</FUNDER>
</TITLESTMT>
<EDITIONSTMT><P>First TEI conformant edition</P>
</EDITIONSTMT>
<EXTENT><SEG TYPE="designation">Text data</SEG>
<SEG TYPE="size">c. 1,1 million words; Filesize uncompressed: 7.8Mb</SEG>
<SEG TYPE="format">SGML TEI Lite</SEG>
<SEG TYPE="location">online</SEG>
</EXTENT>
<PUBLICATIONSTMT>
<PUBLISHER>Josef Schmied, Claudia Claridge, Rainer Siemund</PUBLISHER>
<DISTRIBUTOR>
<NAME KEY="ota" TYPE="organisation">Oxford Text Archive</NAME>
<NAME TYPE="place">Oxford</NAME>
<ADDRESS>
<ADDRLINE><NAME KEY="oucs" TYPE="organisation">Oxford University Computing
Servi
ces</NAME></ADDRLINE>
<ADDRLINE>13 Banbury Road</ADDRLINE>
<ADDRLINE>Oxford</ADDRLINE>
<ADDRLINE>OX2 6NN</ADDRLINE>
<ADDRLINE><NAME TYPE="email">info at ota.ahds.ac.uk</NAME>
</ADDRLINE>
</ADDRESS>
</DISTRIBUTOR>
<PUBPLACE>Chemnitz</PUBPLACE>
<DATE>1998</DATE>
<IDNO ID="OTA">2400</IDNO>
<AVAILABILITY STATUS="free"><P>Original texts by permission of the Founders'
Lib
rary, University of Wales, Lampeter. Copyright of electronic version: REAL
Centr
e, Chemnitz University of Technology. The corpus is freely available for
scholar
ly use in private research and also for teaching purposes.</P>
</AVAILABILITY></PUBLICATIONSTMT>
<SOURCEDESC><P>17th- and 18th-century collections of the Founders' Library,
Univ
ersity of Wales, Lampeter (formerly Saint David's College), especially the
Tract
 Collection, cf. Saint David's University College. 1975. A catalogue of the
trac
t collection of Saint David's University College, Lampeter. London: Mansell
Info
rmation Publishing.</P>
</SOURCEDESC><!-- For more documentation on the compilation, structure and
encod
ing practices of the corpus see the websites of the Oxford Text Archive and
of I
CAME.-->


    [ Part 2, Application/OCTET-STREAM (Name: "corp2400.sgm")  14KB. ]
    [ Unable to print this part. ]


**************************************

>From kruyt at inl.nl Mon Jun 24 09:44:45 2002
Date: Fri, 7 Jun 2002 14:45:01 +0200
From: Truus Kruyt <kruyt at inl.nl>
To: Mikko Lounela <mlounela at kotus.fi>
Subject: Re: Corpora: Corpus metadata

Have a look at Kruyt & Dutilh 1997 at www.inl.nl sub Publications.
Best, Truus Kruyt
**************************************



More information about the Corpora mailing list