Corpora: Using SARA to query other corpora than the BNC

Fri Jun 22 17:46:47 UTC 2001

   > Meta languages are ideal for interchange purposes but I doubt
   > that ANY software will handle SGML data describing 100 million
   > annotated word forms efficiently. But that's another story.

   On what grounds do you make this assertion? I suppose it all
   depends what you mean by "handle efficiently", but it's simply not
   true that NO software can handle SGML data on that scale.

Perhaps he should have written "raw SGML data", in which case I will
absolutely second that opinion. All XML encodings that I have seen so
far waste more space (in terms of characters) on markup than on the
actual data. An XML-encoded version of a 100 million word corpus (with
PoS and lemma annotations) will usually take up several gigabytes of
disk space.

Of course, the corpus size can be drastically reduced with standard
compression alogrithms (gzip or bzip2), but the compressed corpus
cannot be accessed efficiently.

   And what
   would you advocate as an alternative?

Hope you don't mind the plug: the IMS Corpus Workbench was designed
for corpora of that size and offers both (relatively) compact storage
and (relatively) efficient access (it isn't available for HP-UX either,
though).

Regards,
Stefan.

--
``I could probably subsist for a decade or more on the food energy
  that I have thriftily wrapped around various parts of my body.''
                                                -- Jeffrey Steingarten
______________________________________________________________________
C.E.R.T. Marbach                         (CQP Emergency Response Team)
http://www.ims.uni-stuttgart.de/~evert                  schtepf at gmx.de