[Corpora-List] Resources concerning multilabel problem
Ralf Steinberger
ralf.steinberger at jrc.it
Fri Aug 18 12:32:18 UTC 2006
Dear Cecilie,
We have recently made available the JRC-Acquis corpus, which is a
multilingual (21 languages) document collection multi-labelled according to
the Eurovoc thesaurus and aligned at paragraph level for each of the 210
language pairs. You find it for download at:
http://langtech.jrc.it/JRC-Acquis.html
Furthermore, in the 'Publications' section of our web site
(http://langtech.jrc.it/#Publications), you find a number of papers on
(typically multilingual) multi-label text categorisation applications (look
mainly around the years 2002-2004), including the following:
Pouliquen Bruno, Ralf Steinberger & Camelia Ignat (2003). Automatic
<http://langtech.jrc.it/Documents/EuroLan-03_Pouliquen-Steinberger-et-al.pdf
> Annotation of Multilingual Text Collections with a Conceptual Thesaurus.
In: Proceedings of the Workshop Ontologies and Information Extraction at the
Summer School The Semantic Web and Language Technology - Its Potential and
Practicalities (EUROLAN'2003). Bucharest, Romania, 28 July - 8 August 2003.
The text categorisation approach described in that paper is used as the
major ingredient in our daily news analysis system NewsExplorer (freely
accessible at http://press.jrc.it/NewsExplorer) to link related news across
languages.
I hope this helps. All the best,
Ralf
Ralf Steinberger ( <mailto:Ralf.Steinberger at jrc.it> Ralf.Steinberger at jrc.it)
European Commission - Joint Research Centre (JRC)
IPSC - SeS - Language Technology ( <http://langtech.jrc.it/>
http://langtech.jrc.it, <http://press.jrc.it/NewsExplorer/>
http://press.jrc.it/NewsExplorer)
T.P. 267, Via Fermi 1
21020 Ispra (VA), Italy
Tel: +39 0332 78-6271
Fax: +39 0332 78-5154
Secretary: +39 0332 78-5648 or 9478
New URL: http://langtech.jrc.it <http://langtech.jrc.it/> . The previous
address http://www.jrc.it/langtech will only be valid for a few more months.
-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of Cecilie Desiree Widsteen
Sent: 18 August 2006 11:09
To: Corpora list
Subject: [Corpora-List] Resources concerning multilabel problem
Hello all!
I am looking for resources (articles, books, webpages) concerning the
multilabel (multiclass?) problem in the context of text classification.
By this I mean the fact that a document can be classified into more than
one category. Especially w.r.t. supervised learning algorithms, where
the documents in the training set may belong to multiple classes.
Regards,
--
Cecilie Widsteen
Institute for Informatics,
University of Oslo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20060818/35478d94/attachment.htm>
More information about the Corpora
mailing list