[Corpora-List] Corpus Development

Tue Apr 29 08:49:39 UTC 2008

Marco,

I answer below.

Marco Baroni wrote:
> Very interesting thread...
>
> I've been tempted by relational db's for a while, but I keep running
> into the following problem (that perhaps is just due to the fact that
> I don't understand relational db's well enough):
>
>   
>> With a true relational database approach, however, you can have pretty much as much
>>  annotation as you'd like on any word (or text), and there's 
>>     
> essentially no decrease in speed.
>
> ... however, can you define "flexible" queries over sequences of
> annotations?
>
> This is what I mean. With a special-purpose tool like CWB, I can
> easily (although not super-efficiently) define regexps on _sequences
> of POS tokens_. It seems to me that this is pretty fundamental to work
> with natural language. Suppose, e.g., that I want to study verb-noun
> collocations in a certain corpus. I want to be able to mine them with
> something like:
>
> VERB ADV? DET? ADJ* NOUN
>
> I could of course factor this into multiple queries (VERB NOUN, VERB
> ADJ NOUN, VERB ADJ ADJ NOUN, VERB DET NOUN, etc.), but this would get
> very tedious very fast.
>
> Is there any hope to emulate the CWB "regexp at the positional
> annotation level" feature in a relational db?
>
> Thanks.
>
> Regards,
>
> Marco
>   

Yes, this can be emulated in a relational DB.  In fact, I have 
implemented a corpus query system which can search for exactly what you 
suggest, using a relational DB underneath.  My corpus query system is 
called "Emdros", and can be found here:

http://emdros.org

Your specific query would  run something like this:

SELECT ALL OBJECTS
WHERE
[Token pos="VERB"]
[Token pos="ADV"]*{0,1}
[Token pos="DET"]*{0,1}
[Token pos="ADJ"]*
[Token pos="NOUN"]

As I said, this is done with a relational DB as a backend.

The real strength of Emdros, however, lies in efficient querying of 
levels higher than the word, i.e., syntax and/or discourse.

For example:

SELECT ALL OBJECTS
WHERE
[Clause
    [Phrase phrase_type="NP" and phrase_function="Subj"]
    ..
    [Phrase phrase_type="VP" and phrase_function="Pred"
          [Token pos="VERB" and lemma="go"]
    ]
]

This would find all clauses, within which there were two phrases: An NP 
which was also the Subject, followed by arbitrary space (within the 
Clause), signified by "..", followed by a Phrase whose phrase type was 
"VP" and whose function was "Predicate". Somewhere within the phrase, a 
Token whose part-of-speech was "VERB" and whose lemma was "go", would 
have to be present.

All of this assumes an appropriately tagged corpus, of course.

For more information, see the tutorial on the query language:

http://emdros.org/MQL-Tutorial.pdf

There are also several papers on Emdros:

http://www.hum.aau.dk/~ulrikp/pdf/petersen-emdros-COLING-2004.pdf
http://www.hum.aau.dk/~ulrikp/pdf/LREC2006.pdf
http://www.hum.aau.dk/~ulrikp/pdf/Petersen-FSMNLP2005.pdf

Emdros is:

- 9 years old (i.e., mature)
- Open Source
- Very well documented (more than 300 pages of documentation)

Feel free to contact me off-list if you have any questions concerning 
Emdros.

Ulrik Sandborg-Petersen
--
Ulrik Sandborg-Petersen, PhD candidate
Aalborg University, Denmark
http://ulrikp.org -- Home page

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora