[Corpora-List] Corpus Development

Mark Davies Mark_Davies at byu.edu
Mon Apr 28 15:46:39 UTC 2008


Bob,

>> I'd be interested in hearing more about your
views on the differences - pro and con - between
relational DB architecture and full text DB
architectures.

The only "full text DB architecture" I've worked with very much is the "Full Text" Indexing in SQL Server (http://msdn2.microsoft.com/en-us/library/ms142571.aspx). Are we talking about the same thing?

In SQL Server, Full-Text Indexing provides for lightning-fast searches of very large corpora. For example, in a 700 million word corpus of historical English texts (1500s-1800s) that I've created, I can search for a specific word or phrase, and it takes less than half a second to find all occurrences (even for phrases with high frequency words, like 'could be' or 'might know'). Problem is, it only works (well) for exact words and exact phrases. For an annotated corpora (POS, lemma, etc) it's pretty worthless. That's why corpora like CORDE and CREA, which use this architecture (e.g. http://corpus.rae.es/creanet.html) are so limited -- no POS, no lemma, essentially no substrings, and no hope of any of these as well.

I don't use Oracle, but I know that Full-Text searches there are quite a bit more powerful there than with SQL Server (see http://www.oracle.com/technology/pub/articles/asplund-textsearch.html). Apparently they are in mySQL as well (see http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html).

But it seems to me that eventually, Full-Text searches with any of these products will run into a wall. Imagine a corpus where every word has at least a POS and lemma tag (not to mention other types of annotation). Even with decent RegEx functions (as I think Oracle has), it's going to get awfully messy, and possibly sluggish as well (with large corpora).

With a true relational database approach, however, you can have pretty much as much annotation as you'd like on any word (or text), and there's essentially no decrease in speed.

Best,

Mark Davies

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list