[Corpora-List] question about storage of corpora

Mon May 30 13:41:42 UTC 2011

>> So, why bother and store all that in relational DBs? The current XML-DBs are quite efficient and fast

I'm not trying to be contentious -- just wondering. My sense has been that XML works fine for small and medium-sized corpora, but that with larger corpora (e.g. 100 million words or more), it's not overly efficient or fast. Although I don't use IMS CW / CQP and I don't know much about the internal architecture of CQPweb or related architectures like Sketch Engine, my understanding is that the underyling format for these approaches uses relational databases (and needs to, because of the corpus size). I know that the architecture for my corpora (http://corpus.byu.edu/architecture.asp) uses relational databases, and it seems to be quite scalable for large corpora, e.g. 400 million words or more.

So in terms of the scalability of XML, what size are the corpora that you're working with? Has anyone been able to get XML working well with large corpora (e.g. 100 million words or more)? If so, are any of these publicly-available, via a web interface -- it would be nice to take a look.

Thanks in advance,

Mark Davies 

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora