[Corpora-List] question about storage of corpora
Milos Jakubicek
jak at fi.muni.cz
Thu Jun 2 08:13:25 UTC 2011
On 30.5.2011 15:41, Mark Davies wrote:
>>> So, why bother and store all that in relational DBs? The current
>>> XML-DBs are quite efficient and fast
>
> I'm not trying to be contentious -- just wondering. My sense has been
> that XML works fine for small and medium-sized corpora, but that with
> larger corpora (e.g. 100 million words or more), it's not overly
> efficient or fast. Although I don't use IMS CW / CQP and I don't know
> much about the internal architecture of CQPweb or related
> architectures like Sketch Engine, my understanding is that the
> underyling format for these approaches uses relational databases (and
> needs to, because of the corpus size).
As for Sketch Engine, we of course do NOT use any relation databases.
The underlying architecture is solely the Manatee system.
For the underlying input format we use the simplest tab-separated
vertical text as described here:
http://trac.sketchengine.co.uk/wiki/SkE/PreparingCorpus and there are
good reasons for it (e.g. because of data size as pointed out by Adam
Radziszewski, or the fact that such simple format can be so easily
manipulated using standard unix tools).
Regards,
Milos Jakubicek
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list