[Corpora-List] Corpus Development

Mark Davies Mark_Davies at byu.edu
Mon Apr 28 15:03:07 UTC 2008


>> This is also true of Xaira, of eXist, and many other XML-based systems.
>> They used specialised indexing and storage techniques optimised for
>> handling large quantities of text,

Yes, this is what I was referring to when I mentioned that nearly all XML-based architectures use a hybrid approach, where the speed (if it's there) is due to indexes. I've heard some people (especially those who just have small 5-10 million word corpora, or those who haven't actually tried it with large corpora) suggest that you can search through the XML files themselves, but that's probably prohibitive with large corpora. Hence the need for "specialized indexes", as you've pointed out.

>> rather than the specialized indexing
>> and storage techniques used by relational systems which are optimised
>> for handling large numbers of, er, relations. It's true that you can
>> translate (with some loss of information) text into relations, but that
>> doesn't mean you *have* to do so to get your text efficiently processed.

Actually, there's a lot to be gained from using relational databases, even when the number of tables is very small, and where there are relatively few "relations". Many RDMSs allow for clustered indexes (and clustered indexes on alternative views of the table(s)), which really speed up searches. In addition, the algorithms to process (the equivalent of) hashes are highly optimized in these systems -- often more so than systems that use proprietary schemes.

For example, suppose that a large XML database has "specialized indexes" for each word in the corpus, with "offset values" for each occurrence of that word in the corpus, as per:

"Managing Gigabytes: Compressing and Indexing Documents and Images"
De Ian H. Witten, Alistair Moffat, Timothy C. Bell, 1999, Morgan Kaufmann

So for the BNC, for example, there would be about 3,700,000 entries for all of the occurrences of the lemma 'be'. Now suppose the user wants to find the most frequent bigrams of '[be] [aj*]' (is sick, was tired, are green, etc). As I understand the indexing scheme of non-relational databases architectures, it will load the index X for '[be]' (3,700,000 entries) and the index Y for [aj*] (about 7,300,000 entries). It will then do a huge hash operation on these two lists, where the index values in X are one less than the index values in Y. Problem is, these hash operations -- in many proprietary systems -- are S--L--O--W. In a decent RDMS (Oracle, mySQL, SQL Server), it tends to be much faster. For example, in the BYU-BNC (http://corpus.byu.edu/bnc), it's about three seconds.

And for the best speed, of course, you wouldn't be doing a JOIN / hash operation on two large indexes, anyway. You'd be storing contextual information as additional columns within the table with the clustered index.

Best,

Mark Davies

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list