[Corpora-List] Corpus Development
Ulrik Sandborg-Petersen
ulrikp at hum.aau.dk
Tue Apr 29 08:49:39 UTC 2008
Marco,
I answer below.
Marco Baroni wrote:
> Very interesting thread...
>
> I've been tempted by relational db's for a while, but I keep running
> into the following problem (that perhaps is just due to the fact that
> I don't understand relational db's well enough):
>
>
>> With a true relational database approach, however, you can have pretty much as much
>> annotation as you'd like on any word (or text), and there's
>>
> essentially no decrease in speed.
>
> ... however, can you define "flexible" queries over sequences of
> annotations?
>
> This is what I mean. With a special-purpose tool like CWB, I can
> easily (although not super-efficiently) define regexps on _sequences
> of POS tokens_. It seems to me that this is pretty fundamental to work
> with natural language. Suppose, e.g., that I want to study verb-noun
> collocations in a certain corpus. I want to be able to mine them with
> something like:
>
> VERB ADV? DET? ADJ* NOUN
>
> I could of course factor this into multiple queries (VERB NOUN, VERB
> ADJ NOUN, VERB ADJ ADJ NOUN, VERB DET NOUN, etc.), but this would get
> very tedious very fast.
>
> Is there any hope to emulate the CWB "regexp at the positional
> annotation level" feature in a relational db?
>
> Thanks.
>
> Regards,
>
> Marco
>
Yes, this can be emulated in a relational DB. In fact, I have
implemented a corpus query system which can search for exactly what you
suggest, using a relational DB underneath. My corpus query system is
called "Emdros", and can be found here:
http://emdros.org
Your specific query would run something like this:
SELECT ALL OBJECTS
WHERE
[Token pos="VERB"]
[Token pos="ADV"]*{0,1}
[Token pos="DET"]*{0,1}
[Token pos="ADJ"]*
[Token pos="NOUN"]
As I said, this is done with a relational DB as a backend.
The real strength of Emdros, however, lies in efficient querying of
levels higher than the word, i.e., syntax and/or discourse.
For example:
SELECT ALL OBJECTS
WHERE
[Clause
[Phrase phrase_type="NP" and phrase_function="Subj"]
..
[Phrase phrase_type="VP" and phrase_function="Pred"
[Token pos="VERB" and lemma="go"]
]
]
This would find all clauses, within which there were two phrases: An NP
which was also the Subject, followed by arbitrary space (within the
Clause), signified by "..", followed by a Phrase whose phrase type was
"VP" and whose function was "Predicate". Somewhere within the phrase, a
Token whose part-of-speech was "VERB" and whose lemma was "go", would
have to be present.
All of this assumes an appropriately tagged corpus, of course.
For more information, see the tutorial on the query language:
http://emdros.org/MQL-Tutorial.pdf
There are also several papers on Emdros:
http://www.hum.aau.dk/~ulrikp/pdf/petersen-emdros-COLING-2004.pdf
http://www.hum.aau.dk/~ulrikp/pdf/LREC2006.pdf
http://www.hum.aau.dk/~ulrikp/pdf/Petersen-FSMNLP2005.pdf
Emdros is:
- 9 years old (i.e., mature)
- Open Source
- Very well documented (more than 300 pages of documentation)
Feel free to contact me off-list if you have any questions concerning
Emdros.
Ulrik Sandborg-Petersen
--
Ulrik Sandborg-Petersen, PhD candidate
Aalborg University, Denmark
http://ulrikp.org -- Home page
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list