[Corpora-List] text categorisation - newspaper

Marina Santini (Inwind) santinim at inwind.it
Thu Jun 26 10:30:31 UTC 2003


Dear Silvia, Marco and Alessandra,

For my PhD project, I'm working on a categorization scheme
that "goes beyond topic", namely
I'm involved in text genre categorization on the Web.

For my master project, I worked on the Italian corpus LE-PAROLE,
and you can find 2 papers that can be interesting for you:

Marina Santini, Fattori per i testi, "Italiano e oltre", 2/2003,
La Nuova Italia, pp. 78-82.

Marina Santini, Text typology and statistics. Explorations in Italian
press subgenres, "Italian Journal of Linguistics/Rivista di
linguistica",
Volume 13, numero 2, 2001, pp. 339-374.


I will be glad to give you any further details.

Good luck

Marina Santini
PhD student at ITRI
(University of Brighton - UK)
www.itri.brighton.ac.uk



-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of Silvia Bernardini
Sent: 16 June 2003 09:48
To: corpora at uib.no
Subject: [Corpora-List] text categorisation - newspaper


Dear all,

We are about to start the categorization of a corpus of Italian
newspaper text into a set of broad topics (sports, internal affairs,
arts, business, etc). We plan to follow a standard supervised machine
learning approach, tagging a subset of the corpus manually, and
following the usual train/test/classify cycle.

We would like to find information about other projects concerning the
categorization of newspaper text -- in particular, we are interested in
the topic sets that have been used in similar projects. For example, if
somebody has the list of topics used in the AP text cat collection, and
could send us a copy, that would be extremely useful.

Also, some of our prospective users are interested in a categorization
scheme that goes beyond topics, further categorizing documents across
topics into a small set of genres such as *comments* and *news*. This
seems to be a harder task, and we would be interested in work that
pursued similar issues.

More in general, we would be grateful for any sort of advice/information
that seems relevant (e.g., pointers to other text cat work on Italian,
etc.)

Thanks a lot!

Silvia Bernardini, Marco Baroni & Alessandra Volpi
SSLMIT, University of Bologna at Forli'
Italy



More information about the Corpora mailing list