[Corpora-List] text categorisation - newspaper

Luisa Bentivogli bentivo at itc.it
Mon Jun 23 14:27:34 UTC 2003


Silvia Bernardini wrote:

> We would like to find information about other projects concerning the
> categorization of newspaper text -- in particular, we are interested in
> the topic sets that have been used in similar projects. For example, if
> somebody has the list of topics used in the AP text cat collection, and
> could send us a copy, that would be extremely useful.

Here at ITC-irst we are creating the MEANING Italian Corpus (MIC), a 150
million word corpus of written contemporary Italian developed with the aim of
supporting domain-based Word Sense Disambiguation. The MIC is composed of
newspaper articles, press agency news, and web documents and its novelty
consists in the fact that domain-representativeness is the fundamental
criterion for text selection.

The topic set  used is that of WordNet-Domains. WN-DOMAINS is an extension of
WordNet 1.6 where each synset has been annotated with at least one domain
label, selected from a set of 164 labels hierarchically organized. WN-Domains
is currently used within the Natural Language Processing community for
different tasks, such as word sense disambiguation and text categorization.
The WN-Domains hierarchy was created starting from the subject field codes
used by current dictionaries, and the Dewey Decimal Classification system
(DDC), which is the most widely used library classification system in the
world and provides a very large and complete set of hierarchically structured
domain labels.

A core set of 42 basic domains (the second level of the WN-Domains hierarchy)
has been chosen to be represented in the MIC. The list of domains can be
found at
http://tcc.itc.it/research/textec/topics/acquisition-resources/WN-DOMAINS.txt

while for more information about WN-DOMAINS you can visit
http://wndomains.itc.it/

You could also be interested in the NERC report (see EAGLES Recommendations
on Text Typology at
http://www.ilc.cnr.it/EAGLES96/texttyp/node37.html),
which offers a summary of the classification systems used by major corpus
projects in Europe. The MIC is in line with the European trend in corpus
practice as most of the commonly used topics reported in that document
correspond to our basic domains.

All th best,

Luisa Bentivogli

--
Luisa Bentivogli -  bentivo at itc.it
Centro per la Ricerca Scientifica e Tecnologica
Via Sommarive, 18  38050 Povo - Trento ITALY
Tel: +39-0461-314-574  Fax: +39-0461-302-040
http://tcc.itc.it/people/bentivogli.html



More information about the Corpora mailing list