[Corpora-List] question about storage of corpora

Milos Jakubicek jak at fi.muni.cz
Thu Jun 2 08:13:25 UTC 2011


On 30.5.2011 15:41, Mark Davies wrote:
>>> So, why bother and store all that in relational DBs? The current
>>> XML-DBs are quite efficient and fast
>
> I'm not trying to be contentious -- just wondering. My sense has been
> that XML works fine for small and medium-sized corpora, but that with
> larger corpora (e.g. 100 million words or more), it's not overly
> efficient or fast. Although I don't use IMS CW / CQP and I don't know
> much about the internal architecture of CQPweb or related
> architectures like Sketch Engine, my understanding is that the
> underyling format for these approaches uses relational databases (and
> needs to, because of the corpus size).

As for Sketch Engine, we of course do NOT use any relation databases. 
The underlying architecture is solely the Manatee system.

For the underlying input format we use the simplest tab-separated 
vertical text as described here: 
http://trac.sketchengine.co.uk/wiki/SkE/PreparingCorpus and there are 
good reasons for it (e.g. because of data size as pointed out by Adam 
Radziszewski, or the fact that such simple format can be so easily 
manipulated using standard unix tools).

Regards,
Milos Jakubicek


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list