19.3604, FYI: KRYS I Corpus for Genre Classification Research

Mon Nov 24 19:34:30 UTC 2008

LINGUIST List: Vol-19-3604. Mon Nov 24 2008. ISSN: 1068 - 4875.

Subject: 19.3604, FYI: KRYS I Corpus for Genre Classification Research

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews: Randall Eggert, U of Utah  
         <reviews at linguistlist.org> 

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, 
and donations from subscribers and publishers.

Editor for this issue: Matthew Lahrman <matt at linguistlist.org>
================================================================  

To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.

===========================Directory==============================  

1)
Date: 20-Nov-2008
From: Yunhyong Kim < y.kim at hatii.arts.gla.ac.uk >
Subject: KRYS I Corpus for Genre Classification Research

-------------------------Message 1 ---------------------------------- 
Date: Mon, 24 Nov 2008 14:33:05
From: Yunhyong Kim [y.kim at hatii.arts.gla.ac.uk]
Subject: KRYS I Corpus for Genre Classification Research

E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=19-3604.html&submissionid=196986&topicid=6&msgnumber=1

The Humanities Advanced Technology and Information Institute (HATII) at 
the University of Glasgow and the Digital Curation Centre (DCC) are 
delighted to announce the release of the KRYS I Corpus for genre 
classification research.

http://www.krys-corpus.eu

The corpus, consisting of 6434 documents labelled with document genres, 
is expected to become a major research resource among text processing 
and data and information management researchers. In particular, we 
encourage the use of the corpus for the research of:

- Automated Text Classification (TC)
- Digital curation and metadata extraction
- Natural Language Processing (NLP)
- Computational Linguistics (CL)

Despite the potential of document genre classification as a supporting 
step in language processing, document management, and information 
retrieval (e.g. the linguistic style and the vocabulary of a document 
varies distinctively across document genres), to date, there has been a 
severe lack of genre-labelled document corpora with which researchers 
can experiment. It is, therefore, with great pleasure that the 
Humanities Advanced Technology and Information Institute (HATII) at the 
University of Glasgow and the Digital Curation Centre (DCC) makes the 
KRYS I Corpus available to researchers around the globe.

The Corpus originated as part of the ongoing Semantic Metadata 
Extraction research at the Digital Curation Centre 
(http://www.dcc.ac.uk) and the HATII at the University of Glasgow 
(http://www.hatii.arts.gla.ac.uk). The metadata extraction research 
evolved into a study of automated genre classification, reflecting the 
observation that the genre of a document (e.g. whether a document is a 
scientific article or a letter) is characterised by the form and 
structure of a document, the understanding of which would facilitate 
further extraction of metadata from within the document.

Further details about the development of the KRYS I corpus are available 
via the website (http://www.krys-corpus.eu). Specifically, researchers 
will find a detailed account of the document collection process, the 
reclassification of the documents in the corpus, and the initial 
findings with regard to human classification of the documents.

We encourage researchers to make full use of this corpus for their own 
research activity and recommend that you consider contributing towards 
the ongoing development of the corpus by adding your own documents to 
the database. Instructions as to how to contribute to the corpus are 
provided at http://www.krys-corpus.eu.

Comments and/or feedback on the KRYS I Corpus are invited. Contacts 
details can be found on the website. Please feel free to distribute this 
announcement to any interested colleagues.

-- 
Yunhyong Kim
DCC Curation Resources Researcher
Humanities Advanced Technology and Information Institute (HATII)
University of Glasgow (charity number SC004401)
Glasgow
United Kingdom 

Linguistic Field(s): Computational Linguistics

-----------------------------------------------------------
LINGUIST List: Vol-19-3604