[Corpora-List] Corpus Development

Mon Apr 28 14:15:44 UTC 2008

Serge HEIDEN wrote:

> Le Sunday, April 27, 2008 6:44 PM [GMT+1=CET],
> Mark Davies <Mark_Davies at byu.edu> a écrit :
>> Most really large corpora that I'm aware of do use a relational
>> database architecture, including systems like IMS Corpus Workbench.
> 
> The IMS Corpus Workbench software's architecture is based on
> specific indexing technics related to textual data processing and querying.
> Those techniques where described in the book :
> "Managing GigabytesCompressing and Indexing Documents and Images"
> De Ian H. Witten, Alistair Moffat, Timothy C. Bell, 1999, Morgan Kaufmann.
> No RDBMS system or architecture the-like was used and this can
> be seen from the source : http://cwb.sourceforge.net/
> 

This is also true of Xaira, of eXist, and many other XML-based systems. 
They used specialised indexing and storage techniques optimised for 
handling large quantities of text, rather than the specialized indexing 
and storage techniques used by relational systems which are optimised 
for handling large numbers of, er, relations. It's true that you can 
translate (with some loss of information) text into relations, but that 
doesn't mean you *have* to do so to get your text efficiently processed.

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora