Corpora: Using SARA to query other corpora than the BNC

Stefan Evert evert at IMS.Uni-Stuttgart.DE
Fri Aug 3 19:02:26 UTC 2001


So here comes an extremely late reply, and that with my being about to
go on holiday ...

   The BNC sampler disk we made a few years ago was intended to
   provoke some informed discussion of the relative strengths of a
   variety of what were regarded then as state of the art corpus
   access systems (it included Wordsmith, SARA, CWB, and Qwick) when
   handling SGML marked up corpus data. If such discussion has
   happened, I seem to have missed it. Ah well.

I was going to say that now, perhaps, we have an opportunity to start
such a discussion; but having myself taken more than a month to write
an answer my hopes for a lively discussion aren't that high any more.

   I share your high opinion of Corpus Work Bench by the way; it is
   indeed an excellent piece of software.

Thanks for the praise. Of course, as with every piece of software, the
next version is going to be much better. Which brings me back to the
shameless plug I put into my last e-mail:

   ``Hope you don't mind the plug: the IMS Corpus Workbench was
   designed for corpora of that size and offers both (relatively)
   compact storage and (relatively) efficient access (it isn't
   available for HP-UX either, though).''

inviting Adam Kilgarriff's riposte

   ah, but that invites the repost "when?!?!" (for a new interface)

It seems that the only way of getting the new release out at last is
to commit myself publicly to a deadline. So here goes: the new version
of the IMS Corpus Workbench will be released around end of September
(2001 -- I shouldn't leave myself any loopholes :o). The version
number is going to be 3.0 as we skipped version 2.3 that we had meant
to release about 2 years ago.

I hope to get many of you interested in the new release (precompiled
binaries for SUN Solaris and x86-Linux only), and thus make another
attempt to stir up a discussion about corpus access software.

   I don't think it is much better than SARA with respect to disk
   space usage however --

I haven't had a close look at how much disk space SARA uses, but I can
give you some figures for the CWB, which at least allow a comparison
with plain XML files. For a (German) 40 million token corpus without
annotations and XML-style markup the CWB binary format requires about
150 MB of disk space (using compression), including the index files.
The same text in plain ASCII (ISO-8859-1, to be precise) encoding
takes up more than 240 MB, and an XML format would increase the size
even further. Even when the ASCII text is compressed with GZip, it is
still 97 MB large -- and that doesn't give you an index.

   both systems are able to give good performance (for retrieval
   purposes) because both systems make optimised external index files.
   You have to add those into the equation if you are talking about
   efficiency, surely.

When I talk about XML data format, I usually assume that there are no
external index files; but that may be an attitude that many of you do
not share.

   And the last time I looked at it, CWB was less able to make use of
   the SGML markup in a corpus than SARA is.

The new version will be much better in that respect. However, to be
fair one has to admit that it still requires a certain amount (and the
right kind) of preprocessing to make the information from the SGML
markup readily available in corpus queries.


Kind regards,
Stefan.

--
``I could probably subsist for a decade or more on the food energy
  that I have thriftily wrapped around various parts of my body.''
                                                -- Jeffrey Steingarten
______________________________________________________________________
C.E.R.T. Marbach                         (CQP Emergency Response Team)
http://www.ims.uni-stuttgart.de/~evert                  schtepf at gmx.de



More information about the Corpora mailing list