[Corpora-List] Analysing Reuters Corpus Using Wordsmith Version 3

Mike Scott mike at lexically.net
Mon Jun 14 08:13:50 UTC 2004


Dear All

At 09:15 11/06/2004, Siew Imm Tan wrote:
>
>I am interested in analysing the Reuters Corpus using Wordsmith Tools
>Version 3. The problem is that Reuters comprises more than 800,000 XML
>files but Wordsmith can only process up to 16,368 files. Has anybody ever
>attempted using Wordsmith Version 3 to analyse Reuters? If so, how do you
>go around this particular limitation? Is it possible to merge the 800,000
>Reuters files into 16,000 files or so?
>


Yes, WordSmith 3 can only handle 16,000 text files or so, and yes, it might
manage the job in theory (I haven't tried) if the number of files were
reduced by gluing one text to another until the whole corpus were reduced
to 16,000 files. WordSmith 4 has no pre-set limit on the number of text files.

However, a really huge corpus will certainly be time-consuming to process
in either version since some tasks are computed in memory. In versions 3
and 4, the WordList procedure stores each new word-form in memory and every
time another token of the same word-form type is encountered, adds to the
frequency information stored with the word-form. So if there are huge
numbers of different word forms, the PC will slow down considerably when
its RAM is exhausted, at which point Windows starts to store information in
a so-called "swap-file" on the hard disk.

In the case of concordancing, each time a hit is found, the concordance
line and some other bits of data are stored "on the fly" as in wordlisting.
So if there are lots of hits as in the case of a common word-form there
will be lots of RAM used in storing these concordance lines and eventually
the processing will be slowed somewhat.

In practice, this means that it is easier to process huge corpora by doing
the work in chunks, e.g. making separate wordlists of different parts of
the corpus and later merging them. For concordancing, in most cases there
is a cut-off point imposed by time; users aren't prepared to wait more than
say 1 or 2 minutes for results. A solution is to make an index of the
corpus. That is how the CoBuild project tackled the problem, by avoiding
doing work "on the fly" and using a standard index which is lengthy to
build and hard to edit but which once made "knows" about each word-form in
the whole corpus. Google uses something similar -- when you submit a
request it doesn't search the Internet but searches only its own index.

As corpora get bigger, this problem gets harder to solve. WordSmith 3 was
designed to handle corpora up to the BNC (100 million words) in size but in
on the fly processing had considerable difficulty with the whole BNC;
WordSmith 4 handles the whole BNC much more easily but wasn't really
designed to tackle corpora as big as Reuters on the fly. There's a
trade-off between having a fixed corpus (which is best indexed) and a
corpus which isn't so fixed (eg. only of certain parts of the BNC, or of
one's current stock of student EFL writings).

Mike Scott

Applied English Language Studies Unit
University of Liverpool
Liverpool L69 3BX, UK.

Mike.Scott at liv.ac.uk
http://www.lexically.net
http://www.liv.ac.uk/~ms2928



More information about the Corpora mailing list