[Corpora-List] Developing Linguistic Corpora: a guide to good practice

Mon Oct 10 14:33:22 UTC 2005

The Arts and Humanities Data Service (AHDS) have published 'Developing 
Linguistic Corpora', edited by Martin Wynne of the Oxford Text Archive. 
This is the latest in the series of AHDS Guides to Good Practice.

The printed book can be ordered online from Oxbow Books 
(http://www.oxbowbooks.com/) for £15 plus post and packing, and the full 
text is available for free online at http://ahds.ac.uk/linguistic-corpora/.

In this volume, a selection of leading experts offer advice to help the 
reader to ensure that their corpus is well-designed and fit for the 
intended purpose.

As John Sinclair writes in the first chapter: "A corpus is a remarkable 
thing, not so much because it is a collection of language text, but 
because of the properties that it acquires if it is well-designed and 
carefully-constructed."

The collection includes the following chapters:

* 'Corpus and text: basic principles' by John Sinclair
* 'Adding linguistic annotation' by Geoffrey Leech
* 'Metadata for corpus work' by Lou Burnard
* 'Character encoding in corpus construction' by Tony McEnery and 
Richard Xiao
* 'Spoken language corpora' by Paul Thompson
* 'Archiving, distribution and preservation' by Martin Wynne

John Sinclair sets out ten principles for corpus design, plus a new 
definition of a corpus. Geoffrey Leech offers a taxonomy of types of 
annotations as well as clear guidelines and some provisional standards 
for annotation at various linguistic levels. Lou Burnard explains the 
different types of metadata which can be provided for a corpus, and 
gives examples of how these can be implemented using the Text Encoding 
Initiative guidelines. Tony McEnery and Richard Xiao take on the tricky 
issue of encoding characters in languages other than English, giving an 
historical overview of the various solutions, leading to a discussion of 
how to use Unicode today in encoding corpus texts. Paul Thompson draws 
on his experience in developing the British Academic Spoken English 
(BASE) corpus to set out the stages involved in the development and 
exploitation of a corpus of speech, covering data collection, 
transcription, markup and annotation, and access. In chapter six, Martin 
Wynne explains how good planning and design can help to ensure the 
ongoing availability and usefulness of a corpus.

This and other guides in the series are available from 
http://www.ahds.ac.uk/creating/guides/.

AHDS Literature, Languages and Linguistics is hosted by the Oxford Text 
Archive, and is the repository for many freely available corpora in 
several languages, including English, French, German, Italian, Chinese 
and a variety of South Asian languages. There are also historical 
corpora, such as the Old English Corpus, the Helsinki Corpus of English 
Texts and the Lampeter Corpus of Early Modern English Tracts. These 
resources can be found via the experimental new AHDS cross-subject 
catalogue at  http://www.ahds.ac.uk/, and at the OTA website at 
http://www.ota.ox.ac.uk. A listing of corpora is at 
http://www.ota.ox.ac.uk/search/search.perl?misc=corpus. Note that some 
of these resources are available for immediate download and others 
require the user to write in for permission to download them.

Regards,
Martin

-- 
Martin Wynne
Head of the Oxford Text Archive and
AHDS Literature, Languages and Linguistics

Oxford University Computing Services
13 Banbury Road
Oxford
UK - OX2 6NN
Tel: +44 1865 283299
Fax: +44 1865 273275
martin.wynne at oucs.ox.ac.uk