[Corpora-List] Corpus Development
Lou Burnard
lou.burnard at oucs.ox.ac.uk
Mon Apr 28 14:15:44 UTC 2008
Serge HEIDEN wrote:
> Le Sunday, April 27, 2008 6:44 PM [GMT+1=CET],
> Mark Davies <Mark_Davies at byu.edu> a écrit :
>> Most really large corpora that I'm aware of do use a relational
>> database architecture, including systems like IMS Corpus Workbench.
>
> The IMS Corpus Workbench software's architecture is based on
> specific indexing technics related to textual data processing and querying.
> Those techniques where described in the book :
> "Managing GigabytesCompressing and Indexing Documents and Images"
> De Ian H. Witten, Alistair Moffat, Timothy C. Bell, 1999, Morgan Kaufmann.
> No RDBMS system or architecture the-like was used and this can
> be seen from the source : http://cwb.sourceforge.net/
>
This is also true of Xaira, of eXist, and many other XML-based systems.
They used specialised indexing and storage techniques optimised for
handling large quantities of text, rather than the specialized indexing
and storage techniques used by relational systems which are optimised
for handling large numbers of, er, relations. It's true that you can
translate (with some loss of information) text into relations, but that
doesn't mean you *have* to do so to get your text efficiently processed.
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list