[Corpora-List] Corpora and SQL

Wed May 23 13:07:11 UTC 2007

>> One virtue of using real databases rather than text retrieval  
>> engines  is the ability to query both document content and  
>> whatever metadata  one might have associated with the text.  "Find  
>> me blog entries with these words posted on Saturday evenings by  
>> authors whose profile says they were born before 1964 and are  
>> interested in sushi."
>
> Another possibility is to store metadata in a SQL database, and  
> export, on the fly, a subcorpus definition (start and stop  
> positions) for CWB. The best of both worlds, so to speak. This  
> works very well for the Glossa corpus query system (which has a  
> combination of CWB and MySQL as a backend).

The disadvantage of this is that a single engine cannot reason about  
the best way to run your query.  Like other databases, Postgresql  
keeps various summary statistics about the distribution of values in  
each indexed column, and uses these to construct a (hopefully)  
optimal query plan.

To continue my example, if we're interested in blog posts containing  
the rare word "xyzzy", posted by bloggers whose profiles indicate  
they like "food", one would want the DB to fetch the few matching  
posts, using an index, then filter by looking for "food" in those  
bloggers' profiles.  On the other hand, if we're interested in posts  
containing the more common word "sleep", posted by bloggers  
interested in "19th-century Romantic poetry", it might be better to  
fetch the few matching profiles, then filter those by checking  
corrsponding posts.  Having all our data in a single resource allows  
the engine to reason about these tradeoffs.

- John D. Burger
   MITRE