[Corpora-List] text categorisation - newspaper

Jose Maria Gomez Hidalgo jmgomez at dinar.esi.uem.es
Mon Jun 16 08:55:04 UTC 2003


At 09:48 16/06/2003 +0100, Silvia Bernardini wrote:
>Dear all,
>
>We are about to start the categorization of a corpus of Italian newspaper
>text into a set of broad topics (sports, internal affairs, arts, business,
>etc). We plan to follow a standard supervised machine learning approach,
>tagging a subset of the corpus manually, and following the usual
>train/test/classify cycle.
>
>We would like to find information about other projects concerning the
>categorization of newspaper text -- in particular, we are interested in
>the topic sets that have been used in similar projects. For example, if
>somebody has the list of topics used in the AP text cat collection, and
>could send us a copy, that would be extremely useful.

An european news categorization project was NAMIC 
(http://www.dcs.shef.ac.uk/nlp/namic/).

Text categorization test collections for your problem are (in English):
* 
Reuters-21578 
(http://www.daviddlewis.com/resources/testcollections/reuters21578/)
* Reuters Corpus, Volume 1 
(http://about.reuters.com/researchandstandards/corpus/) (use this, is much 
bigger and challenging).
You can get topics from them.

Also you can use sections of newspapers.

For information on TC, and resources for Italian, contact the Istituto di 
Linguistica Computazionale - Consiglio Nazionale Ricerche 
(http://www.ilc.cnr.it/indexflash.html) and Fabrizio Sebastiani 
(http://faure.iei.pi.cnr.it/~fabrizio/), from the Istituto di Scienza e 
Tecnologia dell'Informazione - Consiglio Nazionale Ricerche 
(http://www.iei.pi.cnr.it/).


>Also, some of our prospective users are interested in a categorization
>scheme that goes beyond topics, further categorizing documents across
>topics into a small set of genres such as *comments* and *news*. This
>seems to be a harder task, and we would be interested in work that pursued
>similar issues.
>
>More in general, we would be grateful for any sort of advice/information
>that seems relevant (e.g., pointers to other text cat work on Italian,
>etc.)
>
>Thanks a lot!
>
>Silvia Bernardini, Marco Baroni & Alessandra Volpi
>SSLMIT, University of Bologna at Forli'
>Italy



_______________________________________________________________________________

Jose Maria Gomez Hidalgo
Departamento de Inteligencia Artificial
Universidad Europea de Madrid
28670 - Villaviciosa de Odon - MADRID
(+34) 912115670
jmgomez at dinar.esi.uem.es
http://www.esi.uem.es/~jmgomez/
_______________________________________________________________________________

La legislación española ampara el secreto de las comunicaciones. Este 
correo electrónico es estrictamente confidencial y va dirigido 
exclusivamente a su destinatario/a. Si no es Ud., le rogamos que no difunda 
ni copie la transmisión y nos lo notifique cuanto antes.

Spanish law guarantees privacy in electronic communications. This 
electronic transmission is strictly confidential and intended solely for 
the addressee. If you are not the intended addressee, you are kindly 
requested not to disclose nor to copy this transmission and to notify us as 
soon as possible.



More information about the Corpora mailing list