[Corpora-List] Call for Interest in Participation: Col loquium at Corpus Linguistics 2007 (Bir mingham, UK)

Wed Jun 20 15:47:37 UTC 2007

Call for Interest in Participation:

"Towards a Reference Corpus of Web Genres"

Colloquium held in conjunction with Corpus Linguistics 2007 

Organizers: Marina Santini and Serge Sharoff 

Birmingham, UK, - July 27, 2007 

Colloquium website: http://corpus.leeds.ac.uk/serge/webgenres/ 
Colloquium schedule: http://corpus.leeds.ac.uk/serge/webgenres/schedule.html
Corpus Linguistics 2007 website: http://www.corpus.bham.ac.uk/conference2007  

Description 

Genres of spoken and written texts are being intensively studied from various angles, e.g., communication studies, discourse analysis, computational linguistics, without arriving at a generally accepted definition. Many corpora have been built to represent the language, but very few large corpora indicate genres, and when they do the typology of genres varies widely. For instance, the Brown corpus famously uses 15 textual categories, from press reportage (a text genre) to religion or skills and hobbies (domains), while the British National Corpus (BNC) uses 70 classes, such as academic or non-academic scientific texts or biography. Interestingly, genre classes in the BNC are an add-on proposed by David Lee (Lee, 2001) after the corpus construction, rather than a basic criterion of the corpus creation. The genre attribute was included in a few collections used in information retrieval (TREC HARD 2003 and 2004, or TREC-2006 Blog Track), but the set of genres proposed was either debatable (e.g. the ‘reaction’ genre in TREC HARD 2003), or limited to a single genre (e.g. the blog genre in TREC-2006 Blog Track).

The web is new, so it is even less not clear how to apply traditional notions of genre to web documents. In corpus-based genre studies, the main tendency has been to build one's own genre collection according to subjective criteria for corpus composition, genre annotation, and genre granularity. Genre annotation has been based either on the common sense of a single rater, or on the agreement of few annotators. In brief, as it is now, web genre analyses remain self-contained and corpus-dependent. 

Building a reference corpus of web genres is certainly difficult because web documents are often characterised by a high level of genre hybridism, by a fragmentation of textuality across several documents, by the impact of technical features such as hyperlinking, posting facilities and multi-authoring. Since the web is a huge reservoir of documents that can be easily mined for building all sorts of corpora, it is important to overcome the subjectivity that characterizes genre-related issues, in order to create sharable resources. What should we consider when designing a reference corpus of web genres? Genres of web documents show some traits that are not accounted for in TREC collections or in the BNC and that are, instead, important on the web. For example: 

* Genre Hybridism and Individualization 
The fluidity and fast-paced dynamism of the web together with the complexity of web pages cause unclear genre conventions, and favour genre mixture and authorial creativity. These two phenomena appear to be very common on the web. 

* Granularity of the Unit of Analysis 
How many granularities of the unit of analysis should be included? Only genres representing web sites? Only genre representing web pages? Both? 

* Format of Web Documents 
An issue related to the previous one is represented by the 'format' that should be used to store the 'units of analysis' in a collection. In what form can a web page or a website be included in a corpus? In HTML format or in a text-only version? Including images or leaving them out? Removing boilerplates or keeping them? In, a database-like form, as DOM trees, as a net of graphs, in HTML format, or simply in a text-only version? 

* Genre Granularity and Similarity 
Genres can be accounted for at subgenre, genre and super-genre level: what level of genre granularity should be applied in the reference corpus? Furthermore, should similar genres, such as TUTORIAL and HOW-TO, be accounted for separately? 

* How to build a Genre Palette 
How many and which genres should be included in a genre reference corpus? 

* Validation and Evaluation of a Reference Corpus of Web Genres 
How can we validate and evaluate the quality of a genre corpus? 

Rationale for the Colloquium 

The rationale for this colloquium is to draw up an initial list of characteristics and requirements for building, annotating and evaluating reference corpora of web genres. 

Building a genre-annotated reference corpus of web pages is arduous for a number of reasons, and several solutions appear to be viable. In this colloquium, we would like to make a first attempt to apply the concept of genre to the development of sharable criteria for building genre corpora. 

The ambition of this colloquium, the first ever organized on this topic, is to bring together researchers from different communities such as corpus linguistics, genre analysis, digital genre community, computational linguistics, and information retrieval in order to promote the discussion and development of new ideas and methods to create new corpora for language studies and as evaluation resources.   

SHORT ABSTRACTS 

Alexander Mehler: A Corpus Model of Structure Formation in Hypertext Types
This paper describes a web genre corpus model. Its starting point is a graph model of the logical document structure of hypertext types and of the linkage of their constituents. We describe an XML-based serialization of this model and provide a database mapping which retains a wide range of web genre data. This will be exemplified by three web genres. 

Barbara H. Kwasnik, Kevin Crowston, Joseph Rubleske and You-Lee Chun: Building a Corpus of Genre-Tagged Webpages for an Information-Access Experiment
This presentation reports on one phase of a larger study whose overarching aim is to determine how providing genre metadata can help in access to sources of information in a digital environment. We have built a corpus of genre-tagged web pages and structured this particular experimental corpus in such a way as to provide the maximum control for our experiments. We recognize, however, that much rich genre information was either too difficult to represent or had to be pared away.

Serge Sharoff: In the garden and in the jungle: comparing genres in the BNC and Internet
According to Adam Kilgarriff the BNC is a jungle when compared to smaller Brown-type corpora, but it looks more like an English garden when compared to the Internet. In this presentation I will compare English and Russian Internet corpora against their human-collected counterparts (BNC and RNC) using two methods: the first involves manual annotation of a subset of Internet corpora, the second one uses probabilistic classifiers. The study shows that the Internet is not radically different from the BNC: Internet corpora do contain a wide range of genres and approximate many genres that exist in their printed form, the same is true for the audience level (texts for professional or layman texts).

Mark Rosso: Development of a Genre Palette
This presentation details the development of a genre palette used in the study of the effects of genre-annotated search results on the relevance judgement process in a web search environment. This palette development was conducted in several phases: (i) a survey of user terminology; (ii) user-based refinement of terminology into a tentative genre palette, and (iii) user validation of the genre palette. 

Andrea Stubbe and Christoph Ringlstetter: Recognizing Genres
We introduce a two-level hierarchy of genres based on the definition of genre in terms of form and function (or purpose). Thereby we provide sufficient granularity with the possibility to return to a coarser scheme when preferable. As some texts may naturally fall into more than one genre, an assignment to multiple classes is possible. For those applications where a unique class is required, several techniques for the combination of classifiers were evaluated.

Andrea Stubbe, Christoph Ringlstetter, Tong Zheng, and Randy Goebe: Incremental genre classification
In this presentation we will describe attempts to acquire data. These attempts have to consider the users explicitly and cooperatively. The user behaviour will be simulated using annotated corpus data. We will also formulate different scenarios for information gain representing different levels of uncertainty. Our goal is to integrate existing material of different sources into a realistic application. 

Cornelius Puschmann: SchemaCMD: An XML-based storage schema for the compilation of mixed-source CMD corpora
This presentation will outline an XML schema for the segmentation and storage of data from Internet sources, specifically those which utilize so-called web feeds (often associated with the RSS protocol). It is based on the faceted classification scheme recently proposed by Susan Herring and aims to make data from diverse sources accessible and comparable in a single format.

Registration 

Information on registration and registration fees are provided at the CL2007 website: http://www.corpus.bham.ac.uk/conference2007 

Date and Location 
Colloquium is scheduled for Friday, 27 July 2007, from 13:45 to 17:30
Corpus Linguistics 2007 Venue: University of Birmingham, Birmingham, UK

Programme Committee 

Marco Baroni (University of Trento, Italy) 
Stefan Gries (University of California, USA) 
Adam Kilgarriff (Lexmasterclass, UK) 
Alexander Mehler (Bielefeld University, Germany) 
Sven Meyer zu Eissen (University of Weimar, Germany) 
Paul Rayson (UCREL, Lancaster University, UK) 
Georg Rehm (University of Tuebingen, Germany) 
Marina Santini (University of Brighton, UK) 
Serge Sharoff (University of Leeds, UK) 
Benno Stein (University of Weimar, Germany) 

Organizing Committee 

Marina Santini (University of Brighton, UK) 
Email: MarinaSantini.MS at gmail.com 
Personal Home Page: http://www.nltg.brighton.ac.uk/home/Marina.Santini/ 

Serge Sharoff (University of Leeds, UK) 
Email: s.sharoff at leeds.ac.uk 
Personal Home Page: http://corpus.leeds.ac.uk/serge/ 

Contact Information 
For questions or comments, please contact Marina Santini (MarinaSantini.MS at gmail.com), or Serge Sharoff (s.sharoff at leeds.ac.uk).