[Corpora-List] topic identification literature

Laurel S Stvan STVAN at uta.edu
Sun Jul 7 18:26:00 UTC 2002

Dear fellow researchers,

A colleague and I are working on a project concerning topic identification.
He's more computational and I'm more linguistic, so at first we had to
negotiate what we meant by topic.  Essentially, we are looking at ways to
abstract a given web page to see if it matches a particular topic.  We'll
have access to POS tags, frequency info, HTML code, and WordNet info.
Here's my question: Is there a widely accepted way to use these pieces of
info to identify the topic of pictures on a page, or do people each cobble
together their own identification techniques?

I'm familiar with Hovy and Lin 1999 and the material on the ACM SIG IR site,
but I'm curious if there is any linguistic literature that is a touchstone
on web document topic identification. Leads to any relevant literature would
be appreciated. I'll be happy to post a summary.


Laurel Smith Stvan
Assistant Professor
Program in Linguistics
University of Texas at Arlington
stvan at uta.edu

More information about the Corpora mailing list