[Corpora-List] Corpus Development

Marco Baroni marco.baroni at unitn.it
Tue Apr 29 07:42:32 UTC 2008


Very interesting thread...

I've been tempted by relational db's for a while, but I keep running
into the following problem (that perhaps is just due to the fact that
I don't understand relational db's well enough):

> With a true relational database approach, however, you can have pretty much as much
>  annotation as you'd like on any word (or text), and there's 
essentially no decrease in speed.

... however, can you define "flexible" queries over sequences of
annotations?

This is what I mean. With a special-purpose tool like CWB, I can
easily (although not super-efficiently) define regexps on _sequences
of POS tokens_. It seems to me that this is pretty fundamental to work
with natural language. Suppose, e.g., that I want to study verb-noun
collocations in a certain corpus. I want to be able to mine them with
something like:

VERB ADV? DET? ADJ* NOUN

I could of course factor this into multiple queries (VERB NOUN, VERB
ADJ NOUN, VERB ADJ ADJ NOUN, VERB DET NOUN, etc.), but this would get
very tedious very fast.

Is there any hope to emulate the CWB "regexp at the positional
annotation level" feature in a relational db?

Thanks.

Regards,

Marco

PS
I have a feeling that I posted the same question on some list not so
long ago -- hope at least it's not the same list ;-)


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list