Corpora: Using SARA to query other corpora than the BNC
Thomas Kuenneth
tommi at linguistik.uni-erlangen.de
Mon Jun 25 08:53:04 UTC 2001
First of all, please accept my apologies for having taken such a long time to
respond. During the weekend I was quite busy preparing several things for
COMPLEX2001.
I'd then like to respond to Lou Burnard:
> Good to see that we have some agreement on that at any rate.
Well, it is absolutely neccessary to have corpus data encoded in a well
documented format that can be read/interpreted by any system that is interested
in the data. And meta languages are undoubtedly an ideal base for this
interchange purpose.
Nonetheless I think that there are better ways to represent corpus data
internally, inside the system. I was referring to this, when I said:
> > will handle SGML data describing 100 million annotated word forms
> > efficiently.
Your - well I guess - surprised response:
> On what grounds do you make this assertion? I suppose it all depends what
> you mean by "handle efficiently", but it's simply not true that NO
> software can handle SGML data on that scale
I was quite happy to see that Stefan Evert imagined just the right thing. :-)
As you probably recall he said:
> Perhaps he should have written "raw SGML data", in which case I will
> absolutely second that opinion. All XML encodings that I have seen so
> far waste more space (in terms of characters) on markup than on the
> actual data. An XML-encoded version of a 100 million word corpus (with
> PoS and lemma annotations) will usually take up several gigabytes of
> disk space.
That - in a nutshell - is what I should have said. :-)
> And what would you advocate as an alternative?
Well, my basic assumption is that implementing database technology for storing
coprus data implies a lot of problems that can be avoided if "of the shelf
systems" are used instead. I claim that an RDBMS can in fact be an ideal base
for storing and retrieving corpus data. Since SQL is not really a user friendly
language (at least for linguists :-)) client programs implement a user interface
that actually communicates with the RDBMS. That by the way is what I am going to
talk about at Bham this week.
Regards
Thomas
---
Thomas Kuenneth M.A. Universitaet Erlangen-Nuernberg
Institut fuer Germanistik Abteilung Computerlinguistik
Bismarckstr. 6 * D-91054 Erlangen * Tel.: +49 9131 8529250
http://www.linguistik.uni-erlangen.de/~tommi
More information about the Corpora
mailing list