[Corpora-List] Corpora and SQL
John D. Burger
john at mitre.org
Wed May 23 13:07:11 UTC 2007
>> One virtue of using real databases rather than text retrieval
>> engines is the ability to query both document content and
>> whatever metadata one might have associated with the text. "Find
>> me blog entries with these words posted on Saturday evenings by
>> authors whose profile says they were born before 1964 and are
>> interested in sushi."
>
> Another possibility is to store metadata in a SQL database, and
> export, on the fly, a subcorpus definition (start and stop
> positions) for CWB. The best of both worlds, so to speak. This
> works very well for the Glossa corpus query system (which has a
> combination of CWB and MySQL as a backend).
The disadvantage of this is that a single engine cannot reason about
the best way to run your query. Like other databases, Postgresql
keeps various summary statistics about the distribution of values in
each indexed column, and uses these to construct a (hopefully)
optimal query plan.
To continue my example, if we're interested in blog posts containing
the rare word "xyzzy", posted by bloggers whose profiles indicate
they like "food", one would want the DB to fetch the few matching
posts, using an index, then filter by looking for "food" in those
bloggers' profiles. On the other hand, if we're interested in posts
containing the more common word "sleep", posted by bloggers
interested in "19th-century Romantic poetry", it might be better to
fetch the few matching profiles, then filter those by checking
corrsponding posts. Having all our data in a single resource allows
the engine to reason about these tradeoffs.
- John D. Burger
MITRE
More information about the Corpora
mailing list