[Corpora-List] Analysing Reuters Corpus Using Wordsmith Version 3

Tony Rose tr at acl.icnet.uk
Fri Jun 11 16:22:42 UTC 2004


> A problem with both programs might (or might as well not) be the overall
> size of the corpus. According to my rough-and-dirty counts and
> extrapolations the RC has more than half a billion tokens -- which would
> slow down the more complex searches quite a bit (at least with WST 3.0).
> Btw, have you (or anyone else) done a proper word count of the
> corpus? (the
> RC distributors told me they hadn't) -- Using MP2.2 would of course be a
> solution to that problem since it does a word count whenever you load a
> corpus anyway.

FYI you can find lots more statistics on the corpus at:

http://about.reuters.com/researchandstandards/corpus/statistics/index.asp

and many pre-processed versions of the raw data are linked from Dave Lewis's
web page, e.g.

http://www.daviddlewis.com/resources/testcollections/rcv1/

Cheers,
Tony



More information about the Corpora mailing list