[Corpora-List] Corpora and SQL
Lars Nygaard
lars.nygaard at iln.uio.no
Tue May 22 18:55:57 UTC 2007
John D. Burger wrote:
>> Is this not a performance nightmare? A table with 200 million entries?
>
>
> Many databases are routinely used for far larger datasets.
>
> With respect to the original query, the industrial-strength DB Postgres
> has a well-developed extension for text search called tsearch2:
>
> http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
Can you use it to query columnar data, as in the example in the original
posting?
> One virtue of using real databases rather than text retrieval engines
> is the ability to query both document content and whatever metadata one
> might have associated with the text. "Find me blog entries with these
> words posted on Saturday evenings by authors whose profile says they
> were born before 1964 and are interested in sushi."
Another possibility is to store metadata in a SQL database, and export,
on the fly, a subcorpus definition (start and stop positions) for CWB.
The best of both worlds, so to speak. This works very well for the
Glossa corpus query system (which has a combination of CWB and MySQL as
a backend).
-lars
More information about the Corpora
mailing list