[Fwd: [Corpora-List] text categorisation - newspaper] (fwd)

Carl Lewis Sable sable at cs.columbia.edu
Mon Jun 16 15:37:27 UTC 2003


Hi,

A friend of mine forwarded your message below.  You will likely be
interested in our Newsblaster project, which is available on the web at
http://newsblaster.cs.columbia.edu.  Every night, Newsblaster
automatically crawls many popular news sites in search of what it thinks
are News articles.  It automatically clusters articles into groups such
that every article within a single group is thought to discuss the same
event.  A summary is automatically generated for each event.  Also, and I
think this directly relates to what you ask for below, each cluster of
News articles is automatically categorized into one of the categories
"U.S. News", "World News", "Entertainment", "Sports", "Finance", or
"Sci/Tech".  This was my part of the project; we use an approach I call
BINS, which can be thought of as a generalization of Naive Bayes that
computes word weights for groups of words sharing statistical features in
common (as opposed to individual words like regular Naive Bayes).  Our
accuracy is very high, I believe over 90\% and maybe as high as 95\%.
See for yourself!

In addition to Newsblater, I also created a corpus that I used for my own
work, involving news articles with embedded images from a variety of
Usenet newsgroups, and I have defined several sets of categories.  One
data set that applies to the news articles specifically involves the
categories "Politics", "Struggle", "Crime", "Disaster" or "Other", defined
to be mutually exclusive.  I hope to soon make this corpus publicly
available.  When this happens, instructions to download the corpus will be
posted at:

http://www1.cs.columbia.edu/~sable/research/corpus.html

-Carl

---------- Forwarded message ----------
Date: Mon, 16 Jun 2003 09:41:46 -0400
From: David Evans <devans at cs.columbia.edu>
To: Carl Sable <sable at cs.columbia.edu>
Subject: [Fwd: [Corpora-List] text categorisation - newspaper]

hey carl,

   are you interested in getting stuff like this?  I'm on the corpora
list, and thought you might have an interest...

dave

-------- Original Message --------
Subject: [Corpora-List] text categorisation - newspaper
Date: Mon, 16 Jun 2003 09:48:21 +0100
From: Silvia Bernardini <silvia at sslmit.unibo.it>
To: <corpora at uib.no>

Dear all,

We are about to start the categorization of a corpus of Italian newspaper
text into a set of broad topics (sports, internal affairs, arts, business,
etc). We plan to follow a standard supervised machine learning approach,
tagging a subset of the corpus manually, and following the usual
train/test/classify cycle.

We would like to find information about other projects concerning the
categorization of newspaper text -- in particular, we are interested in
the topic sets that have been used in similar projects. For example, if
somebody has the list of topics used in the AP text cat collection, and
could send us a copy, that would be extremely useful.

Also, some of our prospective users are interested in a categorization
scheme that goes beyond topics, further categorizing documents across
topics into a small set of genres such as *comments* and *news*. This
seems to be a harder task, and we would be interested in work that pursued
similar issues.

More in general, we would be grateful for any sort of advice/information
that seems relevant (e.g., pointers to other text cat work on Italian,
etc.)

Thanks a lot!

Silvia Bernardini, Marco Baroni & Alessandra Volpi
SSLMIT, University of Bologna at Forli'
Italy



More information about the Corpora mailing list