[Corpora-List] question about storage of corpora

Damir Cavar dcavar at indiana.edu
Mon May 30 14:21:48 UTC 2011


Hi Mark et al.,

On May 30, 2011, at 3:41 PM, Mark Davies wrote:

>>> So, why bother and store all that in relational DBs? The current XML-DBs are quite efficient and fast
> 
> So in terms of the scalability of XML, what size are the corpora that you're working with? Has anyone been able to get XML working well with large corpora (e.g. 100 million words or more)? If so, are any of these publicly-available, via a web interface -- it would be nice to take a look.

yes, more than 100 mil. tokens is not a deal. I know that some time ago the XML DBs were slow, but... apparently this is no longer the case. Once any type of data is indexed well, it doesn't actually matter that much anymore what size it is.

The Philologic interface that you get here:
http://riznica.ihjj.hr/philologic/Cijeli.whizbang.form.en.html
is over a large collection (100 mil. or so), pure TEI P5 XML, but the annotation is only till <p>, and the only thing slow is the index generation, but the online index access should not vary much depending on the size of the corpus, no? The online access with the index is as fast as can be. It would not matter much, if we index over <w> and annotation attributes like lemma as well.

BaseX is also supposed to be fast with large XML file collections. The developers are here, shall we try to convince them to do some evaluation with 1 bil. token corpora? Maybe this could be accepted as a paper somewhere, a comparison of the current XML DBs? :-)

best wishes
DC


--
Dr. Damir Cavar
http://web.me.com/dcavar/
mobile +49 176 60928748
office +49 7531 885357
private (US): +1 (734) 330-2902
FaceTime: dcavar at me.com



_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list