[Corpora-List] querying corpora

Sat Mar 1 09:42:00 UTC 2008

 Hi Sebastian,
~
 here are my impressions after a first read:
~
// __
~
 About this paper:
~
 http://www.coli.uni-saarland.de/~pado/pub/papers/ijcnlp08_burchardt.pdf
~
 I will have to read it with more time. Honestly, I put off by the OWL approach.
~
 I may be flatly wrong by I don't share the vision that is being hyped
by  OWL-based research, even less when applied to corpora; among many
other reasons, because ontologies are very questionable artifacts
exploitable through first order logic
~
 I think corpora should use agreed-on POS markers as their addressable
units in the same way DBMS use "columns", but they should also let
their users get down to the most minimal, not obviously grammatical
layer also.
~
 How many time has been the letter "e" used in all texts comprising
the corpus, not including cases of "the" in which the "e"
"sounds-like" "the"?
~
// __
 This paper:
~
 http://www.alta.asn.au/events/altw2004/publication/04-22.pdf
~
 is a very interesting one!
~
> * Firstly, updates are not supported as query languages focus on the needs of linguists searching for syntactic constructions.
~
 A corpus is not supposed to be edited, but updates to personalized
and referring "deltas" may be needed and implementing it is not that
hard. In fact they could be understood as some king of private
annotations that override certain segments of the corpus. Say,
lawyers, scientists or biz people incorporate all text relating to
some matter. Someone should be able to readily reused their corpora,
by priming their own texts in a structured way with these corpora, so
that as they write they can read what other people have said about it
and they can chose to dissent, question, expand on or disregard some
points.
~
> * Secondly, their relationship to existing database query languages is poorly understood, making it difficult to apply standard database indexing and query optimization techniques. As a consequence they do not scale well.
~
 Yeah! Even though DBMSs are a well-tested technology we can
definitely learn from and use, well-structured data and texts don't
make a good math. See below about "scaling well"
~
> * Finally, linguistic annotations have both a sequential and a hierarchical organization. Query languages must support queries that refer to both of these types of structure simultaneously. Such hybrid queries should have a concise syntax. The interplay between these factors has resulted in a variety of mutually-inconsistent approaches.
~
 Well, why is it so hard to come up with an idea of how this "concise
syntax" should look like?
~
> 2.6 LPath
~
 Very interesting! Since it is based on XPath and text comprising
alphabetical nat. langs are naturally representable through syntax
trees. How exhaustive is LPath?
~
> Query trees are generally very small (if spread widely) so massive trees decrease filter effectiveness during query processing and have a bad effect on matching algorithms.
~
 Well, I think the whole idea of building a corpus is to readily index
using structures so that the response to most (if not all) queries are
instant
~
> Thus subtree movement should appear as a basic operation. (Cotton and Bird, 2002) present a tree edit operations all in terms of node movement of a distinguished node.
~
 Yes!
~
 Thank you very much
 lbrtchx

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora