[Corpora-List] Corpora and SQL

Lars Nygaard lars.nygaard at iln.uio.no
Tue May 22 18:42:37 UTC 2007


Oliver Mason wrote:
> Is this not a performance nightmare?  A table with 200 million entries?

A challenge, but not necessarily a nightmare. MySQL has no problem in 
handling 200 million rows; and tables can be compressed and stored in 
memory for incrased performance. Collocate searching would have to be 
heavily optimised, though.

> I would guess something specifically designed for textual data would
> be better (eg the system described in 'Managing Gigabytes' by
> Moffat/Witten/Bell).

Well, the Moffat/Witten/Bell system is not very well suited for 
linguistics, but CWB (which was originally written based on the 
Gigabytes book) is, and would in most cases have better performance than 
SQL.

As always it depends, but I would agree that CWB (or similar tools like 
Manatee) is in general the best solution for corpus linguistics.

cheers,
lars



More information about the Corpora mailing list