[Corpora-List] question about storage of corpora

Wed Jun 1 15:49:58 UTC 2011

Sorry Tine, and the others,

here's just some comment on some of the recent arguments related to your posting and some replies:

Space and XML as a storage:
Well, I just bought a fast SATA 2 TB disk for around 100 Euro for my private purposes, to extend my existing 1.5 and 1 TB disks, and backup them. A DB takes also space, and it is not true that the space is reduced to almost nothing, just maybe one Xth (is it 1/3 or 1/4 in your cases?). 240 GB or 40 GB, I don't see the need in putting time and effort in mapping the XML to a RelDB just to spare some space.

What I mentioned about the CLC, it is raw TEI P5 XML, and I do not need to store it in a DB to get good performance, as the online interface shows http://riznica.ihjj.hr/ choose any of the subcorpora, the "complete" one should be over 100 mil. The rendering of the results is most of the time including a XSLT call on the raw XML data to create the HTML view, the documents are raw XML TEI P5 files on the server, the rendering to HTML is done with every request, without our server contaminating Zagreb with smoke. You'll probably wait for the connection, not for the server to do the job (except for the collocation analyses in the extended menu). And, there is no relational DB that I needed to maintain and set up, just a storage folder for the XML and a binary generated index.

Speed:
Observing a decrease of speed in any DB for any type of data storage based on the size of the data is usually a sign of poor engineering and/or poor hardware. XML DBs and other DBs do not differ there, so, if you index any field (XML attributes, full text, tags), the search is passing for example a hash function and should be as fast as your hashing function is, and this is true for any DB, relational, XML-based etc. I cannot imagine that access to binary DB tables in a RelDB should be significantly faster than direct access through the Operating Systems File IO to XML files on a disk somewhere (putting my full faith in the current OSes and their good handling of File IO and Cache management, even true for Linux nowadays).

Evaluation:
If you want to test large sizes and speed, just put BaseX on your desktop, no complicated installation procedures, fire it up and create a new Base with your millions of XML files in some folder, set up indexes the right way, and enjoy the power of XQuery, and measure the performance. I cannot do that for our Polish colleagues with more than a billion, my corpus is just around 100 mil. tokens, and I seriously work on a new way to extend the existing CLC interface with a new functionality that makes use of XQuery and maybe BaseX, without the intention to touch RelDBs in any way soon for the corpus work.

ciao
DC

--
Dr. Damir Cavar
http://web.me.com/dcavar/
mobile +49 176 60928748
office +49 7531 885357
private (US): +1 (734) 330-2902
FaceTime: dcavar at me.com

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora