Appel: 2nd Challenge on Large Scale Hierarchical Text Classification

Thierry Hamon thierry.hamon at UNIV-PARIS13.FR
Tue Jan 18 21:11:57 UTC 2011

Date: Mon, 17 Jan 2011 09:00:00 +0100
From: Eric Gaussier <eric.gaussier at>
Message-ID: <4D33F700.5030402 at>

Veuillez nous excuser si vous recevez ce message, sur la tenue du
2ième défi sur la classification de textes à très grande échelle,
plusieurs fois.

                      Second Pascal Challenge on
             Large Scale Hierarchical Text classification

               Web site:
                  Email:lshtc_info at

Following a successful first edition, we are pleased to announce the
2nd edition of the Large Scale Hierarchical Text Classification
(LSHTC) Pascal Challenge. The LSHTC Challenge is a hierarchical text
classification competition, using large datasets. This year's
challenge will increase the scale and the difficulty of the task,
using data from Wikipedia (, in addition to the ODP
Web directory data (

Hierarchies are becoming ever more popular for the organization of
text documents, particularly on the Web. Web directories and Wikipedia
are two examples of such hierarchies. Along with their widespread use,
comes the need for automated classification of new documents to the
categories in the hierarchy. As the size of the hierarchy grows and
the number of documents to be classified increases, a number of
interesting machine learning problems arise. In particular, it is one
of the rare situations where data sparsity remains an issue, despite
the vastness of available data: as more documents become available,
more classes are also added to the hierarchy, and there is a very high
imbalance between the classes at different levels of the
hierarchy. Additionally, the statistical dependence of the classes
poses challenges and opportunities for the learning methods.

The challenge consists of three categorization tasks, involving
different documents and category systems. In particular, the largest
category system, based on Wikipedia, contains more than 300,000
categories and 2M documents for training. The largest category system
ever used in the past for evaluation purposes, to the best of our
knowledge, was based on the Yahoo!  Directory and contained 130,000
categories and 500,000 training documents.  In addition to the largest
task, two smaller ones, based on Wikipedia and DMOZ respectively, are
included in the challenge. The scale of these is in the order of the
first edition of the challenge. All of the datasets in this edition
are multi-label. Particularly in the two datasets that are based on
Wikipedia, each document is assigned on average to 3.2 and 4.6
categories respectively. Furthermore, the hierarchies are no longer
simple tree structures, as both documents and subcategories are
allowed to belong to more than one other category. More information
regarding the tasks and the challenge rules can be found at the
challenge's Web site; follow the "Tasks, Rules and Guidelines" link.

As in the first edition, participants will be able to smoothly and
continuously submit runs, in order to improve their systems. This year
we also plan a two-stage evaluation of the participating methods: one
measuring classification performance and one for computational
performance. It is important to measure both, as they are
dependent. The results will be included in a final report about the
challenge and we also aim at organizing a special ECML'11 workshop.

In order to register for the challenge and gain access to the datasets
you must have an account at the challenge Web site.

Key dates:

Start of testing: January 15, 2011
End of testing: March 31, 2011
Submission of executables and short papers to challenge organizers:
April 30, 2011
Submission of workshop papers: May 31, 2010
ECML'11 workshop (subject to approval): September 5, 2011


George Paliouras, NCSR "Demokritos", Athens, Greece
Eric Gaussier, LIG, Grenoble, France
Aris Kosmopoulos, NCSR "Demokritos"&  AUEB, Athens, Greece
Ion Androutsopoulos, AUEB, Athens, Greece
Thierry Artières, LIP6, Paris, France
Patrick Gallinari, LIP6, Paris, France

Message diffuse par la liste Langage Naturel <LN at>
Informations, abonnement :
English version       : 
Archives                 :

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  :

More information about the Ln mailing list