[Corpora-List] Resources concerning multilabel problem

Ralf Steinberger ralf.steinberger at jrc.it
Fri Aug 18 12:32:18 UTC 2006


Dear Cecilie,

 

We have recently made available the JRC-Acquis corpus, which is a
multilingual (21 languages) document collection multi-labelled according to
the Eurovoc thesaurus and aligned at paragraph level for each of the 210
language pairs. You find it for download at:

 

      http://langtech.jrc.it/JRC-Acquis.html

 

Furthermore, in the 'Publications' section of our web site
(http://langtech.jrc.it/#Publications), you find a number of papers on
(typically multilingual) multi-label text categorisation applications (look
mainly around the years 2002-2004), including the following:

 

Pouliquen Bruno, Ralf Steinberger & Camelia Ignat (2003). Automatic
<http://langtech.jrc.it/Documents/EuroLan-03_Pouliquen-Steinberger-et-al.pdf
>  Annotation of Multilingual Text Collections with a Conceptual Thesaurus.
In: Proceedings of the Workshop Ontologies and Information Extraction at the
Summer School The Semantic Web and Language Technology - Its Potential and
Practicalities (EUROLAN'2003). Bucharest, Romania, 28 July - 8 August 2003. 

The text categorisation approach described in that paper is used as the
major ingredient in our daily news analysis system NewsExplorer (freely
accessible at http://press.jrc.it/NewsExplorer) to link related news across
languages.

 

I hope this helps. All the best,

 

Ralf

 

 

 

 

Ralf Steinberger ( <mailto:Ralf.Steinberger at jrc.it> Ralf.Steinberger at jrc.it)

European Commission - Joint Research Centre (JRC)
IPSC - SeS - Language Technology ( <http://langtech.jrc.it/>
http://langtech.jrc.it,  <http://press.jrc.it/NewsExplorer/>
http://press.jrc.it/NewsExplorer) 
T.P. 267, Via Fermi 1
21020 Ispra (VA), Italy
Tel: +39 0332 78-6271
Fax: +39 0332 78-5154
Secretary: +39 0332 78-5648 or 9478

 

New URL: http://langtech.jrc.it <http://langtech.jrc.it/> . The previous
address http://www.jrc.it/langtech will only be valid for a few more months.

 

 

 

 

-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of Cecilie Desiree Widsteen
Sent: 18 August 2006 11:09
To: Corpora list
Subject: [Corpora-List] Resources concerning multilabel problem

 

Hello all!

 

I am looking for resources (articles, books, webpages) concerning the 

multilabel (multiclass?) problem in the context of text classification. 

By this I mean the fact that a document can be classified into more than 

one category. Especially w.r.t. supervised learning algorithms, where 

the documents in the training set may belong to multiple classes.

 

Regards,

--

Cecilie Widsteen

Institute for Informatics,

University of Oslo

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20060818/35478d94/attachment.htm>


More information about the Corpora mailing list