18.1436, Diss: Computational Ling/Text&Corpus Ling: Santini: 'Automatic Iden...'

Fri May 11 19:31:58 UTC 2007

LINGUIST List: Vol-18-1436. Fri May 11 2007. ISSN: 1068 - 4875.

Subject: 18.1436, Diss: Computational Ling/Text&Corpus Ling: Santini: 'Automatic Iden...'

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews: Laura Welcher, Rosetta Project  
       <reviews at linguistlist.org> 

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, 
and donations from subscribers and publishers.

Editor for this issue: Hunter Lockwood <hunter at linguistlist.org>
================================================================  

To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.

===========================Directory==============================  

1)
Date: 11-May-2007
From: Marina Santini < MarinaSantini.MS at gmail.com >
Subject: Automatic Identification of Genre in Web Pages

-------------------------Message 1 ---------------------------------- 
Date: Fri, 11 May 2007 15:30:20
From: Marina Santini < MarinaSantini.MS at gmail.com >
Subject: Automatic Identification of Genre in Web Pages 

Institution: University of Brighton 
Program: Computational Linguistics 
Dissertation Status: Completed 
Degree Date: 2007 

Author: Marina Santini

Dissertation Title: Automatic Identification of Genre in Web Pages 

Linguistic Field(s): Computational Linguistics
                     Text/Corpus Linguistics

Dissertation Director(s):
Roger Evans
Michael Oakes
Lyn Pemberton
Richard Power

Dissertation Abstract:

The aim of this thesis is to understand how genre is instantiated on the
web, and thereby to develop automatic methods for genre identification in
web pages. The main challenges arise from the interaction of three factors:
(1) the complexity of web pages, (2) the fluidity and the fast-paced
evolution of the web, and (3) the limitation of automatically-extractable
features for genre detection. First, genres on the web are instantiated in
web pages, which, from a physical, linguistic and textual point of view,
can be considered documents of a new type, much more unpredictable and
individualised than documents on paper. Second, the web is unstable and
fluid, undergoing a fast-paced evolution, so genre identification is
influenced by phenomena such as the formation of novel genres, genre
hybridism, individualisation, and intra-genre and inter-genre variation.
Finally, automatically-extractable features represent a poor surrogate for
potentially useful genre-revealing features. These three factors strongly
affect the automatic identification of genre in web pages. Previous work
has disregarded them for the sake of practicality, and built on the
oversimplifying assumption that a web page is to be assigned to only one
genre, relying as little as possible on the linguistic features returned by
NLP tools. By contrast, this thesis argues for the necessity of a more
flexible genre classification scheme, capable of assigning zero, one or
multiple genre labels, and builds as much as possible on the output of NLP
tools. A series of empirical studies is presented which investigate (i) why
a zero-to-multi genre classification scheme would be more appropriate for
classifying web pages, and (ii) to what extent it is possible to implement
this scheme in an automatic system. A new model of zero-to-multi-genre
classification is presented that combines several traditions, incorporating
findings from automatic genre classification, corpus linguistics, genre
analysis, textlinguistics and artificial intelligence. This model offers a
more articulated view of genres in web pages. Although such a model cannot
be fully evaluated, given the limitations of the current state of genre
research, experimental results show that its accuracy on single-genre
classification is competitive: about 86% vs. 90% for a standard
machine-learning model, in ideal conditions; and about 86% vs. 76% in more
realistic conditions. 

-----------------------------------------------------------
LINGUIST List: Vol-18-1436