18.2637, Diss: Computational Ling: Conway: 'Approaches to Automatic Biograph...'

LINGUIST Network linguist at LINGUISTLIST.ORG
Mon Sep 10 18:54:26 UTC 2007


LINGUIST List: Vol-18-2637. Mon Sep 10 2007. ISSN: 1068 - 4875.

Subject: 18.2637, Diss: Computational Ling: Conway: 'Approaches to Automatic Biograph...'

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>
 
Reviews: Randall Eggert, U of Utah  
         <reviews at linguistlist.org> 

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, 
and donations from subscribers and publishers.

Editor for this issue: Hunter Lockwood <hunter at linguistlist.org>
================================================================  

To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.

===========================Directory==============================  

1)
Date: 09-Sep-2007
From: Mike Conway < mike at nii.ac.jp >
Subject: Approaches to Automatic Biographical Sentence Classification: An empirical study

 

	
-------------------------Message 1 ---------------------------------- 
Date: Mon, 10 Sep 2007 14:53:19
From: Mike Conway [mike at nii.ac.jp]
Subject: Approaches to Automatic Biographical Sentence Classification: An empirical study
E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=18-2637.html&submissionid=155781&topicid=14&msgnumber=1  


Institution: University of Sheffield 
Program: Department of Computer Science 
Dissertation Status: Completed 
Degree Date: 2007 

Author: Mike Conway

Dissertation Title: Approaches to Automatic Biographical Sentence
Classification: An empirical study 

Linguistic Field(s): Computational Linguistics


Dissertation Director(s):
Robert Gaizauskas

Dissertation Abstract:

This thesis addresses the problem of the reliable identification of
biographical sentences, an important subtask in several natural language
processing application areas (for example, biographical multiple document
summarisation, biographical information extraction, and so on). The
biographical sentence classification task is placed within the framework of
genre classification, rather than traditional topic based text classification.

Before exploring methods for doing this task computationally, we need to
establish whether, and with what degree of success, humans can identify
biographical sentences without the aid of discourse or document structure.
To this end, a biographical annotation scheme and corpus was developed, and
assessed using a human study. The human study showed that participants were
able to identify biographical sentences with a good level of agreement.

The main body of the thesis presents a series of experiments designed to
find the best  sentence representations for the automatic identification of
biographical sentences from a range of alternatives. In contrast to
previous work, which has centred on the use of single terms (that is,
unigrams) for biographical sentence representations, the current work
derives unigram, bigram and trigram features from a large corpus of
biographical text (including the British Dictionary of National Biography).
In addition to the use of corpus derived n-grams, a novel characteristic of
the current approach is the use of biographically relevant syntactic
features, identified both intuitively and through empirical methods.

The experimental work shows that a combination of n-gram features derived
from the Dictionary of National Biography and biographically orientated
syntactic features yield a performance that surpasses that gained using 
n-gram features alone. Additionally, in accordance with the view of
biographical sentence classification as a genre classification task,
stylistic features (for example, topic neutral function words) are shown to
be  important for recognising biographical sentences. 





-----------------------------------------------------------
LINGUIST List: Vol-18-2637	

	



More information about the LINGUIST mailing list