[Corpora-List] Corpora and SQL

Tue May 22 18:42:37 UTC 2007

Oliver Mason wrote:
> Is this not a performance nightmare?  A table with 200 million entries?

A challenge, but not necessarily a nightmare. MySQL has no problem in 
handling 200 million rows; and tables can be compressed and stored in 
memory for incrased performance. Collocate searching would have to be 
heavily optimised, though.

> I would guess something specifically designed for textual data would
> be better (eg the system described in 'Managing Gigabytes' by
> Moffat/Witten/Bell).

Well, the Moffat/Witten/Bell system is not very well suited for 
linguistics, but CWB (which was originally written based on the 
Gigabytes book) is, and would in most cases have better performance than 
SQL.

As always it depends, but I would agree that CWB (or similar tools like 
Manatee) is in general the best solution for corpus linguistics.

cheers,
lars