[Corpora-List] LUCY Corpus available

Geoff Sampson grs2 at cogs.susx.ac.uk
Mon Nov 24 13:03:20 UTC 2003


The initial release of the LUCY Corpus is now freely available for downloading.
The LUCY Corpus is a treebank sampling modern written British English of
three genres:

*  edited published prose

*  the writing of young adults (e.g. A-level exam scripts, 1st-year
   undergraduate essays)

*  spontaneous writing by 9- to 12-year-old children

Compilation of the LUCY Corpus was sponsored by the Economic and Social
Research Council (UK).  The corpus is named after St Lucia or Lucy, patron
saint of writers.

The corpus is structurally annotated in conformity with the SUSANNE annotation
scheme, defined in my _English for the Computer_ (Clarendon, 1995).
Extensions to the scheme were developed in the LUCY project in order to
represent what is going on in cases where unskilled writers fail to produce
written structures that succeed in expressing their apparent intention.

Documentation for the LUCY Corpus, including a definition of the annotation
conventions just mentioned, can be read as a Web page at
www.grsampson.net/LucyDoc.html (13,000 words).  The Corpus itself is
available via www.grsampson.net/Resources.html, as are earlier resources from
my stable.

The initial LUCY release will undoubtedly contain mistakes.  (That is
particularly likely, since pressure from the sponsor for early
publication meant that there was not enough time for all the checks that
would ideally have been applied.)  Users who find errors are warmly urged
to contact me with details, which will be used to produce later, more
accurate releases.  My e-mail address, in a form designed to foil spammers,
is grs2 followed by at-sign followed by sussex.ac.uk



Geoffrey Sampson  MA  PhD  MBCS
Professor of Natural Language Computing

Department of Informatics
University of Sussex
Falmer, Brighton BN1 9QH, England

t  +44 1273 678525
f  +44 1273 671320
w  www.grsampson.net

e-mail address no longer shown to avoid spam flood



More information about the Corpora mailing list