[Corpora-List] question about storage of corpora

Mon May 30 17:00:55 UTC 2011

Hi all,

I suppose, if you looked at it the right way and squinted a bit, you could call the CWB index format a relational database if you wanted to, since basic text consists of "rows" (tokens) with "columns"/"fields" (the words and word-level annotations or "positional attributes" in official CWB jargon). But the implementation is very different; in a relational database the contents of a "row" are stored together within one "table" whereas in CWB each positional attribute is encoded as an independent token stream, and there is in addition a separate means of representing XML-style ranges within the corpus (or "structural attributes").

The other major difference with CWB is that rows in a relational database are unordered in principle, so if you want to represent an ordered sequence of tokens you would normally provide an integer index in each row. But data in the CWB format is implicitly ordered, which is one of the reasons CWB's Corpus Query Processor (CQP) can do certain kinds of complex searches e.g. regular expressions at the word level, very quickly.

On the more general issue of whether XML is a "good format" - this is a very ambiguous question because you need to ask "good for what?" The requirements of (a) archiving and interchange and (b) searching and processing are very different. Also, we need to be clear about whether by "using XML" we mean "just the raw XML files" or whether we mean "the XML files as processed by an XML indexer/database", which are again different things.

For archiving and data interchange, XML (the file format) is unsurpassed for any size of corpus, as it is software-independent, based on plaintext, and the tags are human-readable and to some degree self-describing. A CWB index or a database would *not* be a good format for this purpose, by contrast, because they are binary formats based on non-self-describing column-and-row input.

However, for searching/processing, handling raw XML files directly is liable to be too slow, as Mark pointed out. (The same is true of plaintext, of course.) The only way round this is to index. Now you can use an XML indexer/query tool, like Xaira, or an XML DB system; or you can pre-process the XML data to a columnar format and import that into CWB or a carefully-optimised relational database system; one way or another, however, you MUST index, or you are just not going to search that billion word corpus in reasonable time. Unless you have a supercomputer or a large chunk of Grid power handy that is!

In sum, if it was a corpus I'd built, I would create the "master copy" in XML, and that would be the version I would distribute and set aside for preservation; but I would also generate from that a copy in vertical format and (thus) a set of CWB indexes that I would use for actual analysis. That's basically the same approach Mario described...

Just my 2p!

Best

Andrew.

Andrew Hardie
Linguistics & English Language
County South
Lancaster University
Lancaster LA1 4YL
United Kingdom

http://www.ling.lancs.ac.uk/staff/hardie

> -----Original Message-----
> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] 
> On Behalf Of Alberto Simões
> Sent: 30 May 2011 17:01
> To: corpora at uib.no
> Subject: Re: [Corpora-List] question about storage of corpora
> 
> Hello
> 
> IMS-CWB or Open-CWB don't use any kind of relational database 
> underneath (also CQPweb use Open-CWB as backend, so, it 
> doesn't use relational databases as well). The format is very 
> compact and efficient for storing annotated corpora.
> 
> Note that XML databases or relational databases are built to 
> be generic. 
> Generic is good, but generic is less powerful.
> 
> I am using CWB for really big corpora and quite happy with 
> speed both on querying and codifying corpora.
> 
> Cheers
> ambs
> 
> On 30/05/2011 14:41, Mark Davies wrote:
> >>> So, why bother and store all that in relational DBs? The current 
> >>> XML-DBs are quite efficient and fast
> >
> > I'm not trying to be contentious -- just wondering. My 
> sense has been that XML works fine for small and medium-sized 
> corpora, but that with larger corpora (e.g. 100 million words 
> or more), it's not overly efficient or fast. Although I don't 
> use IMS CW / CQP and I don't know much about the internal 
> architecture of CQPweb or related architectures like Sketch 
> Engine, my understanding is that the underyling format for 
> these approaches uses relational databases (and needs to, 
> because of the corpus size). I know that the architecture for 
> my corpora (http://corpus.byu.edu/architecture.asp) uses 
> relational databases, and it seems to be quite scalable for 
> large corpora, e.g. 400 million words or more.
> >
> > So in terms of the scalability of XML, what size are the 
> corpora that you're working with? Has anyone been able to get 
> XML working well with large corpora (e.g. 100 million words 
> or more)? If so, are any of these publicly-available, via a 
> web interface -- it would be nice to take a look.
> >
> > Thanks in advance,
> >
> > Mark Davies
> >
> > ============================================
> > Mark Davies
> > Professor of (Corpus) Linguistics
> > Brigham Young University
> > (phone) 801-422-9168 / (fax) 801-422-0906
> > Web: http://davies-linguistics.byu.edu
> >
> > ** Corpus design and use // Linguistic databases **
> > ** Historical linguistics // Language variation **
> > ** English, Spanish, and Portuguese ** 
> > ============================================
> > _______________________________________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
> 
> --
> Alberto Simoes
> CCTC-UM / CEHUM
> 
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
> 

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora