[Corpora-List] text categorisation - newspaper

Silvia Bernardini silvia at sslmit.unibo.it
Mon Jun 16 08:48:21 UTC 2003

Dear all,

We are about to start the categorization of a corpus of Italian newspaper
text into a set of broad topics (sports, internal affairs, arts, business,
etc). We plan to follow a standard supervised machine learning approach,
tagging a subset of the corpus manually, and following the usual
train/test/classify cycle.

We would like to find information about other projects concerning the
categorization of newspaper text -- in particular, we are interested in
the topic sets that have been used in similar projects. For example, if
somebody has the list of topics used in the AP text cat collection, and
could send us a copy, that would be extremely useful.

Also, some of our prospective users are interested in a categorization
scheme that goes beyond topics, further categorizing documents across
topics into a small set of genres such as *comments* and *news*. This
seems to be a harder task, and we would be interested in work that pursued
similar issues.

More in general, we would be grateful for any sort of advice/information
that seems relevant (e.g., pointers to other text cat work on Italian,

Thanks a lot!

Silvia Bernardini, Marco Baroni & Alessandra Volpi
SSLMIT, University of Bologna at Forli'

More information about the Corpora mailing list