[Corpora-List] Typology of Internet textual genres

Marina Santini marinamailinglists at gmail.com
Sat Nov 17 13:07:35 UTC 2007


Dear Ana Rita,

unfortornately, investigations on genres on the web are still at an
early stage. We are intensively working on it, but the web is
problematic for many reasons.
We are in the phase that you describle so neatly: some documents
easily fit into a genre, and consequently some genres are easy to
identify (e.g. FAQs), while other documents are not easy to classify
in terms of genre (your links).

What shall we do in this situation if we want to annotate a corpus by genre?
Well, first of all we have to  keep in mind that we are in a
transitional phase, and we must aknowledge that whatever decision we
make now, it might need to be revised and adjusted within months.
Then we make decisions according to our purpose. I guess your purpose
is to build a corpus for terminology extraction or analysis. So what
kind of info you need?
A document can be annotated at many different levels.
If you have time, my suggestion is that you create a document header
to be added to your documents, containing a multi-level description.

For example, you can crate markup saying:
<genre = eshop, product catalogue>
<type of document = selling page, product description>
<topic = laptop>
<domain = computing, electronics>
<purpose = informational, instructional>
etc.

If you are not sure about one or more fields, you leavei them unspecified.
For example, if you do not know the genre of a web page, you can specify:
<genre = undetermined>

However, when you annotate a corpus, you should also make explicit the
criteria that you use. I do not know if, at the current stage of
evolution of corpus linguistics, you can just claim: in my view, this
document is an eshop. In somebody else's view, it could be a product
catalogue! You should also try to characterize your classification
categories: what is genre?, what is purpose?, what is domain?, etc.
This is what we are trying to do now, but we have not come up with any
standard yet.

A different strategic choice would be to eliminate from your
collections all documents that are difficult to classify.

Another choice would be to "induce" text types (but not genres) using
Biber's multi-dimensional approach.

Well, as you can see, your question is just the tip of a iceberg.

I can suggest some basic readings, but, as we are working very hard on
it, keep an eye on future publications.

* Kilgarriff A. and Grefenstette G. (2003). "Introduction to the
Special Issue on the Web as a corpus". Computational Linguistics. Vol.
29, No. 3, pp. 333-347.

* Lee D. (2001). "Genres, Registers, Text types, Domains, and Styles:
Clarifying the concepts and navigating a path through the BNC Jungle".
Language Learning & Technology. Vol. 5, No. 3, pp. 37-72.

* Rachel Aires, Diana Santos & Sandra Aluisio: "Yes, user!": compiling
a corpus according to what the user wants"

*Caroline Barrière & Akakpo Agbago: "Corpus Construction for  Terminology"

* Classifying Web corpora into domain and genre using automatic
feature identification
Serge Sharoff, University of Leeds, UK

* Santini M. (2007). Characterizing Genres of Web Pages: Genre
Hybridism and Individualization, 40th Annual Hawaii International
Conference on System Sciences (HICSS'07).
etc. etc. etc. etc.

Keep us informed about your corpus and best of luck

Cheers, Marina

On 16/11/2007, Ana Rita Remígio <anaritaremigio at ua.pt> wrote:
>
>
> Dear Marina,
>
> I am looking for manual genre classifications and existing genre collections could be of any help for me.
>
> I am a Portuguese PhD student in Linguistics (Termonology) and I have a corpus with a wide variety of different texts. The written ones were somehow easy to classify, but the electronic ones are difficult. While FAQs are easy to classify, others such as the ones you find on these pages (in Portuguese):
> http://www.cnamimosa.com.pt/saude_detalhe.asp?index=0&categoria=30&dossier=94
> http://www.actimel.pt/whatActimel_introduction.html
> http://iogurte.com/index.php?action=dicas&subaction=1
>
> Those are texts written by the food industry. They just add to their sites all kinds of info about their produts, from ads, to FAQs or papers. But how can I classify those examples I showed you? These are my questions right now.
>
> That is what I meant with existing genre collections. I was hoping someone had already been faced with these same problems.
>
> All the best,
> Ana
>
>
>
> ----- Original Message -----
> From: Marina Santini
> To: Ana Rita Remígio
> Cc: corpora at uib.no
> Sent: Thursday, November 15, 2007 2:30 PM
> Subject: Re: [Corpora-List] Typology of Internet textual genres
>
>
> Dear Ana Rita,
>
> are you interested in manual genre classification or automatic genre classification? Are you looking for selection and annotation criteria to be used for corpus creation and annotation, or are you interested in existing genre collections?
>
> Some web page collections annotated by genre are available from my home page at Brighton (http://www.itri.brighton.ac.uk/~Marina.Santini/). For additional collections, contact Serge Sharoff, Andrea Stubbe and Vedrana Vidulin. Cornelius Pushmann (all in English), Mirko Tavosanis (Italian blogs), Georg Rehm (German academic home pages), Alexandr Mehler (German and Engligh), Pavel Brawslaki (Russian),  etc.
>
> Mind! All existing genre annotated collections have been built with different annotation schemes and different genre palettes.
>
> It would be interesting to have a genre collection containing Portuguese web documents...
>
> Best wishes
>
> Marina
>
>
>
> On 15/11/2007, Ana Rita Remígio <anaritaremigio at ua.pt> wrote:
> >
> >
> > Hello,
> >
> > Does anyone know of papers (or any other references) on classifications of Internet textual genres (FAQs, advertisements, ...)? The goal is to classify different electronic documents taken from the Web used to build a corpus.
> >
> > Thank you in advance,
> > Ana Rita
> > _______________________________________________
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
> >
> >
>
>

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list