[Corpora-List] Analysing Reuters Corpus Using Wordsmith Version 3

Ute Römer ute.roemer at uni-koeln.de
Fri Jun 11 15:58:19 UTC 2004


Dear Tan Siew Imm and others, 

> The problem is that Reuters comprises more than 800,000 XML files but
>Wordsmith can only process up to 16,368 files. Has anybody ever attempted
>using Wordsmith Version 3 to analyse Reuters? 

Yes, I experienced the same problem and would also be interested in a way
around it. So far, I've never really needed to use the whole corpus, so I
only unzipped some of the archives and work with random parts of the corpus.
It would be nice to access it in a more systematic way though.

A possible solution to the problem might be the use of a different
concordance software. From what I see, corpus size is unlimited with
MonoConc Pro 2.2, though I am not 100% about the number of individual files
you can load. WST version 4.0 should also work on a larger (unlimited?)
number of corpus files.

A problem with both programs might (or might as well not) be the overall
size of the corpus. According to my rough-and-dirty counts and
extrapolations the RC has more than half a billion tokens -- which would
slow down the more complex searches quite a bit (at least with WST 3.0).
Btw, have you (or anyone else) done a proper word count of the corpus? (the
RC distributors told me they hadn't) -- Using MP2.2 would of course be a
solution to that problem since it does a word count whenever you load a
corpus anyway. 

Best wishes... Ute


************************************************************
 
Ute Römer
English Department
University of Hanover
Königsworther Platz 1
30167 Hannover
Germany
 
Phone: +49 (0)511 762 2997
Fax: +49 (0)511 762 2996
E-mail: ute.roemer at anglistik.uni-hannover.de
http://www.fbls.uni-hannover.de/angli/



More information about the Corpora mailing list