[Corpora-List] XML Wikipedia Collections for IR/ML Research
Ludovic DENOYER
ludovic.denoyer at lip6.fr
Fri Apr 7 17:39:11 UTC 2006
Wikipedia XML Corpus for research
Ludovic DENOYER
LIP6 - University of Paris 6
http://www-connex.lip6.fr/~denoyer/wikipediaXML
Technical report (currently Draft):
http://www-connex.lip6.fr/~denoyer/homepage/publications/TECHREP2006.pdf
=============
This is an announcement for the release of a set of large XML document
collections.
These collections might be of interest to the Information Retrieval
Community and to the Machine Learning community.
These collections have been developped as a joint project between the
DELOS and PASCAL Networks of Excellence.
===========
We propose a large set of XML collections based on Wikipedia. These
collections can be used in a large variety of XML IR/Machine Learning
tasks like ad-hoc retrieval, categorization, clustering or Structure
Mapping task. These corpora are, for example, used for INEX 2006
competition (http://inex.is.informatik.uni-duisburg.de/2006) and for the
XML Document Mining Challenge (http://xmlmining.lip6.fr).
Brief Collections description:
- 8 Different languages: English, German, French, Dutch, Spanish,
Chinese, Arabian, Japanese
- 660,000 documents for the English collection
- All documents are organized in a hierarchy of categories
- Some collections have been build for the comparison of
categorization/clustering algorithms
- Multimedia Collection (more than 300,000 pictures)
- Entity Collection
Other collections (Cross-Language, NLP Collection) will be provided soon.
More information on the web site:
http://www-connex.lip6.fr/~denoyer/wikipediaXML
Best regards,
Ludovic DENOYER
Assistant Professor
http://www-connex.lip6.fr/~denoyer
More information about the Corpora
mailing list