[Corpora-List] XML Wikipedia Collections for IR/ML Research

Fri Apr 7 17:39:11 UTC 2006

Wikipedia XML Corpus for research

Ludovic DENOYER

LIP6 - University of Paris 6

http://www-connex.lip6.fr/~denoyer/wikipediaXML 

Technical report (currently Draft): 
http://www-connex.lip6.fr/~denoyer/homepage/publications/TECHREP2006.pdf 

=============

This is an announcement for the release of a set of large XML document
collections.
These collections might be of interest to the Information Retrieval 
Community and to the Machine Learning community.
These collections have been developped as a joint project between the 
DELOS and PASCAL Networks of Excellence.

===========

We propose a large set of XML collections based on Wikipedia. These 
collections can be used in a large variety of XML IR/Machine Learning 
tasks like ad-hoc retrieval, categorization, clustering or Structure 
Mapping task. These corpora are, for example, used for INEX 2006 
competition (http://inex.is.informatik.uni-duisburg.de/2006) and for the 
XML Document Mining Challenge (http://xmlmining.lip6.fr).

Brief Collections description:

- 8 Different languages: English, German, French, Dutch, Spanish, 
Chinese, Arabian, Japanese

- 660,000 documents for the English collection

- All documents are organized in a hierarchy of categories

- Some collections have been build for the comparison of 
categorization/clustering algorithms

- Multimedia Collection (more than 300,000 pictures)

- Entity Collection

Other collections (Cross-Language, NLP Collection) will be provided soon.

More information on the web site: 
http://www-connex.lip6.fr/~denoyer/wikipediaXML

Best regards,

Ludovic DENOYER

Assistant Professor

http://www-connex.lip6.fr/~denoyer