[Corpora-List] Looking for mid-range new corpora

Paul McNamee paul.mcnamee at jhuapl.edu
Mon Aug 29 23:13:25 UTC 2005


Chris,

I've used the CACM collection - approx 3MB - for an IR class I teach.
Its small, but there are queries and rel. judgments available, so
I can run a mini-TREC evaluation with the class and provide
students with some data to experiment with.  Because its small,
none of the students have problems working with it.  It even fits
in an editor and they can 'debug' their proto-IR engines by
using the editor's built in search.

The Reuters 21578 collection is about 22 MB, which is about the
next stop before the TREC disks.  It has labels for text classification,
but no ad hoc queries that I am aware of.

You could in principle use a subset of the TREC data, for example, some
researchers report experiments using the AP or WSJ subsets.  This would
decrease the size, but you might have an issue in using the data for
classroom instruction.  I don't think the TREC data agreements permit
this, but I suppose you could request permission for this use.

Helpful links:
   Download site at Glasgow with several legacy IR test sets:
     http://www.dcs.gla.ac.uk/idom/ir_resources/test_collections/

   A recent web page for an IR course I taught:
     http://apl.jhu.edu/~paulmac/ir.html

I don't have any suggestions for newness.  You could use publicly
available corpora (e.g., texts from Project Gutenberg or Wikipedia)
but you'd have to come up with your own assessments.

Best regards,

- Paul

Paul McNamee
Research and Technology Development Center
Johns Hopkins University Applied Physics Lab
11100 Johns Hopkins Road
Laurel MD  20723-6099   USA
Voice: +1 443 778 3816
Fax:   +1 443 778 6904
Email: paul.mcnamee at jhuapl.edu



On Mon, 29 Aug 2005, Chris Jordan wrote:

> Hey all,
>
> I am looking for a mid-range corpora that is relatively new for a higher 
> level undergrad course in Information Retrieval. I don't want to use the TREC 
> sets as they are giant though I don't want something that is insignificant 
> either. Having qrels and some publications on it is a bonus too. Thanks,
>
> -- 
> Chris Jordan
> Dalhousie Computer Science PhD Candidate
> Dalhousie Student Union Graduate Senate Representative



More information about the Corpora mailing list