[Corpora-List] Call for participation: Second Pascal Challenge on Large Scale Hierarchical Text classification

X.L. Wang arthurxlwang at gmail.com
Fri Jan 21 06:21:40 UTC 2011


http://lshtc.iit.demokritos.gr/LSHTC2_CFP

Sorry for cost post.

Key dates

* Start of testing: January 15, 2011
* End of testing: March 31, 2011
* Submission of executables and short papers to challenge organizers:
April 30, 2011
* Submission of workshop papers: May 31, 2010
* ECML'11 workshop (subject to approval): September 5, 2011


Following a successful first edition, we are pleased to announce the 2nd
edition of the Large Scale Hierarchical Text Classification (LSHTC)
Pascal Challenge. The LSHTC Challenge is a hierarchical text
classification competition, using large datasets. This year’s challenge
will increase the scale and the difficulty of the task, using data from
Wikipedia (www.wikipedia.org), in addition to the ODP Web directory data
(www.dmoz.org).

Hierarchies are becoming ever more popular for the organization of text
documents, particularly on the Web. Web directories and Wikipedia are
two examples of such hierarchies. Along with their widespread use, comes
the need for automated classification of new documents to the categories
in the hierarchy. As the size of the hierarchy grows and the number of
documents to be classified increases, a number of interesting machine
learning problems arise. In particular, it is one of the rare situations
where data sparsity remains an issue, despite the vastness of available
data: as more documents become available, more classes are also added to
the hierarchy, and there is a very high imbalance between the classes at
different levels of the hierarchy. Additionally, the statistical
dependence of the classes poses challenges and opportunities for the
learning methods.

The challenge consists of three categorization tasks, involving
different documents and category systems. In particular, the largest
category system, based on Wikipedia, contains more than 300,000
categories and 2M documents for training. The largest category system
ever used in the past for evaluation purposes, to the best of our
knowledge, was based on the Yahoo! Directory and contained 130,000
categories and 500,000 training documents. In addition to the largest
task, two smaller ones, based on Wikipedia and DMOZ respectively, are
included in the challenge. The scale of these is in the order of the
first edition of the challenge. All of the datasets in this edition are
multi-label. Particularly in the two datasets that are based on
Wikipedia, each document is assigned on average to 3.2 and 4.6
categories respectively. Furthermore, the hierarchies are no longer
simple tree structures, as both documents and subcategories are allowed
to belong to more than one other category. More information regarding
the tasks and the challenge rules can be found at the challenge's Web
site; follow the "Tasks, Rules and Guidelines" link.

As in the first edition, participants will be able to smoothly and
continuously submit runs, in order to improve their systems. This year
we also plan a two-stage evaluation of the participating methods: one
measuring classification performance and one for computational
performance. It is important to measure both, as they are dependent. The
results will be included in a final report about the challenge and we
also aim at organizing a special ECML'11 workshop.

In order to register for the challenge and gain access to the datasets
you must have an account at the challenge Web site.

Organisers:


George Paliouras, NCSR "Demokritos", Athens, Greece
Eric Gaussier, LIG, Grenoble, France
Aris Kosmopoulos, NCSR "Demokritos"&  AUEB, Athens, Greece
Ion Androutsopoulos, AUEB, Athens, Greece
Thierry Artières, LIP6, Paris, France
Patrick Gallinari, LIP6, Paris, France


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list