[Corpora-List] corpus ------>>>>> thesaurus

Rob Koeling robk at sussex.ac.uk
Fri Nov 12 17:07:35 UTC 2004


Hello Vladimir,

I am working on creating domain specific thesauruses at the moment.
Creating these thesauruses is not a goal in itself, but a means to
create domain specific rankings of word senses. I am working on ranking
work senses with Diana McCarthy, John Carroll and Julie weeds. You can
read more on this in our ACL-2004 paper.

In this paper we describe an experiment with specific sense rankings for
the Sports and Finance domain. We created a corpus with Sports and one
with Finance texts using the Reuters corpus. We used all the documents
in the Reuters corpus with a Sports label (topic code GSPO) and I think
about a third of the Finance related texts (topic codes ECAT nad MCAT).
The resulting corpora were 9.1 million words and 32.5 million words
respectively. We created the thesauruses using Lin's method. You can
find the details of how we created the thesauruses in the paper. I'm not
sure if we can distribute the resulting thesauruses. I'll have to look
at the Reuters license.

The articles in the Reuters corpus are hand tagged, so the resulting
domain specific corpora should be high quality. Unfortunately there is
very little hand annotated data available. At the moment I am setting up
an experiment to harvest texts from the web in order to create domain
specific corpora. We have selected some 40 different domains (from the
Subject Field Codes hierarchy, see ref. in ACL paper) and created a text
classifier for these domains. These corpora will be used to create
domain specific thesauruses. We want to use these thesauruses to create
specific word sense rankings for all these domains.

I can't say anything yet about how high the quality of these domain
corpora will be. I hope to be able to say more about this in a couple of
months. I don't see any reasons why we wouldn't be able to share these
thesauruses. 

Best,  

  - Rob Koeling



On Tue, 9 Nov 2004, P bI K O B___  B.B. (MOCKBA) wrote:

> 
>     I would be very grateful to anyone for any info concerning
compiling thesaurus from corpus (esp. from corpus of specific domain
documents).
> 
>     As example - thesaurus of financial terms compiled from financial
documents corpus. 
> 
>       Best wishes to all our corpus society !
> 
> -- 
>   Regards Vladimir Rykov
> 
> PhD in Computational Linguistics
> Personal web-site: rykov.narod.ru  
> mailto: rykov2000 at mail.ru  
> Si etiam omnes - ego non
> English version:   www.blkbox.com/~gigawatt/rykov.html
> 
> -- 
> ñÎÄÅËÓ.éÇÒÕÛËÉ - ÑÒËÉÊ ÐÅÒÅÒÙ× × ÓÅÒÙÈ ÔÒÕÄÏ×ÙÈ ÂÕÄÎÑÈ. http://play.yandex.ru/
> 
> 
> 



More information about the Corpora mailing list