Corpora: Using SARA to query other corpora than the BNC

Thomas Kuenneth tommi at linguistik.uni-erlangen.de
Mon Jun 25 08:53:04 UTC 2001


First of all, please accept my apologies for having taken such a long time to
respond. During the weekend I was quite busy preparing several things for
COMPLEX2001.

I'd then like to respond to Lou Burnard:

> Good to see that we have some agreement on that at any rate.

Well, it is absolutely neccessary to have corpus data encoded in a well
documented format that can be read/interpreted by any system that is interested
in the data. And meta languages are undoubtedly an ideal base for this
interchange purpose.

Nonetheless I think that there are better ways to represent corpus data
internally, inside the system. I was referring to this, when I said:

> > will handle SGML data describing 100 million annotated word forms
> > efficiently.

Your - well I guess - surprised response:

> On what grounds do you make this assertion? I suppose it all depends what
> you mean by "handle efficiently", but it's simply not true that NO
> software can handle SGML data on that scale

I was quite happy to see that Stefan Evert imagined just the right thing. :-)

As you probably recall he said:

> Perhaps he should have written "raw SGML data", in which case I will
> absolutely second that opinion. All XML encodings that I have seen so
> far waste more space (in terms of characters) on markup than on the
> actual data. An XML-encoded version of a 100 million word corpus (with
> PoS and lemma annotations) will usually take up several gigabytes of
> disk space.

That - in a nutshell - is what I should have said. :-)

> And what would you advocate as an alternative?

Well, my basic assumption is that implementing database technology for storing
coprus data implies a lot of problems that can be avoided if "of the shelf
systems" are used instead. I claim that an RDBMS can in fact be an ideal base
for storing and retrieving corpus data. Since SQL is not really a user friendly
language (at least for linguists :-)) client programs implement a user interface
that actually communicates with the RDBMS. That by the way is what I am going to
talk about at Bham this week.

Regards
Thomas
---
Thomas Kuenneth M.A.           Universitaet Erlangen-Nuernberg
Institut fuer Germanistik         Abteilung Computerlinguistik
Bismarckstr. 6  *  D-91054 Erlangen  *  Tel.: +49 9131 8529250
http://www.linguistik.uni-erlangen.de/~tommi



More information about the Corpora mailing list