Corpora: Using SARA to query other corpora than the BNC (fwd)

Sat Jun 23 16:21:13 UTC 2001

When we  talk about "efficiency" we usually refer to the performance of
some activity. A format which is efficient for one purpose/activity (such
as fast retrieval of context) is generally not efficient for another (such
as inter platform communication). This is hardly a new idea!

I share your high opinion of Corpus Work Bench by the way; it is indeed an
excellent piece of software. I don't think it is much better than SARA
with respect to disk space usage however -- both systems are able to give
good performance (for retrieval purposes) because both systems make
optimised external index files. You have to add those into the equation if
you are talking about efficiency, surely. And the last time I looked at
it, CWB was less able to make use of the SGML markup in a corpus than SARA
is. (But as a compensating strength it includes an efficient indexing
algorithm for POS marks which SARA didnt). Another major difference is
that the SARA system retains the original text files as well as the index,
whereas I believe CWB discards the text. This certainly reduces the
overall system size, but the price is that some information in the source
text has to be lost.

The BNC sampler disk we made a few years ago was intended to provoke some
informed discussion of the relative strengths of a variety of what were
regarded then as state of the art corpus access systems (it included
Wordsmith, SARA, CWB, and Qwick) when handling SGML marked up corpus data.
If such discussion has happened, I seem to have missed it. Ah well.

Lou

On Fri, 22 Jun 2001, Stefan Evert wrote:

>
>    > Meta languages are ideal for interchange purposes but I doubt
>    > that ANY software will handle SGML data describing 100 million
>    > annotated word forms efficiently. But that's another story.
>
>    On what grounds do you make this assertion? I suppose it all
>    depends what you mean by "handle efficiently", but it's simply not
>    true that NO software can handle SGML data on that scale.
>
> Perhaps he should have written "raw SGML data", in which case I will
> absolutely second that opinion. All XML encodings that I have seen so
> far waste more space (in terms of characters) on markup than on the
> actual data. An XML-encoded version of a 100 million word corpus (with
> PoS and lemma annotations) will usually take up several gigabytes of
> disk space.
>
> Of course, the corpus size can be drastically reduced with standard
> compression alogrithms (gzip or bzip2), but the compressed corpus
> cannot be accessed efficiently.
>
>    And what
>    would you advocate as an alternative?
>
> Hope you don't mind the plug: the IMS Corpus Workbench was designed
> for corpora of that size and offers both (relatively) compact storage
> and (relatively) efficient access (it isn't available for HP-UX either,
> though).
>
> Regards,
> Stefan.
>
> --
> ``I could probably subsist for a decade or more on the food energy
>   that I have thriftily wrapped around various parts of my body.''
>                                                 -- Jeffrey Steingarten
> ______________________________________________________________________
> C.E.R.T. Marbach                         (CQP Emergency Response Team)
> http://www.ims.uni-stuttgart.de/~evert                  schtepf at gmx.de
>
>