[Corpora-List] question about storage of corpora

Alberto Simões albie at alfarrabio.di.uminho.pt
Mon May 30 16:01:04 UTC 2011


Hello

IMS-CWB or Open-CWB don't use any kind of relational database underneath 
(also CQPweb use Open-CWB as backend, so, it doesn't use relational 
databases as well). The format is very compact and efficient for storing 
annotated corpora.

Note that XML databases or relational databases are built to be generic. 
Generic is good, but generic is less powerful.

I am using CWB for really big corpora and quite happy with speed both on 
querying and codifying corpora.

Cheers
ambs

On 30/05/2011 14:41, Mark Davies wrote:
>>> So, why bother and store all that in relational DBs? The current XML-DBs are quite efficient and fast
>
> I'm not trying to be contentious -- just wondering. My sense has been that XML works fine for small and medium-sized corpora, but that with larger corpora (e.g. 100 million words or more), it's not overly efficient or fast. Although I don't use IMS CW / CQP and I don't know much about the internal architecture of CQPweb or related architectures like Sketch Engine, my understanding is that the underyling format for these approaches uses relational databases (and needs to, because of the corpus size). I know that the architecture for my corpora (http://corpus.byu.edu/architecture.asp) uses relational databases, and it seems to be quite scalable for large corpora, e.g. 400 million words or more.
>
> So in terms of the scalability of XML, what size are the corpora that you're working with? Has anyone been able to get XML working well with large corpora (e.g. 100 million words or more)? If so, are any of these publicly-available, via a web interface -- it would be nice to take a look.
>
> Thanks in advance,
>
> Mark Davies
>
> ============================================
> Mark Davies
> Professor of (Corpus) Linguistics
> Brigham Young University
> (phone) 801-422-9168 / (fax) 801-422-0906
> Web: http://davies-linguistics.byu.edu
>
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
> ============================================
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-- 
Alberto Simoes
CCTC-UM / CEHUM

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list