19.3604, FYI: KRYS I Corpus for Genre Classification Research
LINGUIST Network
linguist at LINGUISTLIST.ORG
Mon Nov 24 19:34:30 UTC 2008
LINGUIST List: Vol-19-3604. Mon Nov 24 2008. ISSN: 1068 - 4875.
Subject: 19.3604, FYI: KRYS I Corpus for Genre Classification Research
Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>
Reviews: Randall Eggert, U of Utah
<reviews at linguistlist.org>
Homepage: http://linguistlist.org/
The LINGUIST List is funded by Eastern Michigan University,
and donations from subscribers and publishers.
Editor for this issue: Matthew Lahrman <matt at linguistlist.org>
================================================================
To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.
===========================Directory==============================
1)
Date: 20-Nov-2008
From: Yunhyong Kim < y.kim at hatii.arts.gla.ac.uk >
Subject: KRYS I Corpus for Genre Classification Research
-------------------------Message 1 ----------------------------------
Date: Mon, 24 Nov 2008 14:33:05
From: Yunhyong Kim [y.kim at hatii.arts.gla.ac.uk]
Subject: KRYS I Corpus for Genre Classification Research
E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=19-3604.html&submissionid=196986&topicid=6&msgnumber=1
The Humanities Advanced Technology and Information Institute (HATII) at
the University of Glasgow and the Digital Curation Centre (DCC) are
delighted to announce the release of the KRYS I Corpus for genre
classification research.
http://www.krys-corpus.eu
The corpus, consisting of 6434 documents labelled with document genres,
is expected to become a major research resource among text processing
and data and information management researchers. In particular, we
encourage the use of the corpus for the research of:
- Automated Text Classification (TC)
- Digital curation and metadata extraction
- Natural Language Processing (NLP)
- Computational Linguistics (CL)
Despite the potential of document genre classification as a supporting
step in language processing, document management, and information
retrieval (e.g. the linguistic style and the vocabulary of a document
varies distinctively across document genres), to date, there has been a
severe lack of genre-labelled document corpora with which researchers
can experiment. It is, therefore, with great pleasure that the
Humanities Advanced Technology and Information Institute (HATII) at the
University of Glasgow and the Digital Curation Centre (DCC) makes the
KRYS I Corpus available to researchers around the globe.
The Corpus originated as part of the ongoing Semantic Metadata
Extraction research at the Digital Curation Centre
(http://www.dcc.ac.uk) and the HATII at the University of Glasgow
(http://www.hatii.arts.gla.ac.uk). The metadata extraction research
evolved into a study of automated genre classification, reflecting the
observation that the genre of a document (e.g. whether a document is a
scientific article or a letter) is characterised by the form and
structure of a document, the understanding of which would facilitate
further extraction of metadata from within the document.
Further details about the development of the KRYS I corpus are available
via the website (http://www.krys-corpus.eu). Specifically, researchers
will find a detailed account of the document collection process, the
reclassification of the documents in the corpus, and the initial
findings with regard to human classification of the documents.
We encourage researchers to make full use of this corpus for their own
research activity and recommend that you consider contributing towards
the ongoing development of the corpus by adding your own documents to
the database. Instructions as to how to contribute to the corpus are
provided at http://www.krys-corpus.eu.
Comments and/or feedback on the KRYS I Corpus are invited. Contacts
details can be found on the website. Please feel free to distribute this
announcement to any interested colleagues.
--
Yunhyong Kim
DCC Curation Resources Researcher
Humanities Advanced Technology and Information Institute (HATII)
University of Glasgow (charity number SC004401)
Glasgow
United Kingdom
Linguistic Field(s): Computational Linguistics
-----------------------------------------------------------
LINGUIST List: Vol-19-3604
More information about the LINGUIST
mailing list