[Corpora-List] Parsed corpus file format

Linas Vepstas linasvepstas at gmail.com
Thu Jul 3 18:09:46 UTC 2008


2008/7/3  <maxwell at umiacs.umd.edu>:
>
> Also, I'd be interested in hearing more about the problems Linas Vepstas
> had with using XML to represent trees, and why "triples" or sexprs seem
> better.  I would have thought that XML would be ideal (apart maybe from a
> bit of verboseness), and that triples would be procrustean.

Superficially, the observation isn't deep -- wouldn't seem to be, but the
proposed alternatives seem a lot worse.  Lets start with a tree
represented as an s-expressions.
Example:

  (S (NP the garage) (VP is (PP next to (NP the house))) .)

There's an "obvious" encoding in XML, which no one actually ever uses:

  <S><NP>the garage</NP><VP>is<PP>next to<NP>the house</NP></PP></VP>.</S>

This is slightly less readable, utterly equivalent, and trival to convert
to and from.  Not a big deal, I would think.  Lets look at one proposed
format: eGXL. The encoding would look something like

<graph id="Sentence">
  <graph id="g8">
    <node id="s8_1" form="the">
    <node id="s8_2" form="garage">
    <node id="s8_3" form="is">
   (... etc....)
   <node id="s8_6" form="the">
   <node id="s8_7" form="house">

   <node id="p8_1" form="S">
   <node id="p8_2" form="NP">  <!-- will be (NP the garage) -->
   <node id="p8_3" form="VP">
   <node id="p8_4" form="PP">
   <node id="p8_5" form="NP">  <!-- a different NP than the earlier one-->

   <edge from="p8_5" to="s8_6"/>   <!-- "the" is part of NP -->
   <edge from="p8_5" to="s8_7"/>  <!-- "house" is part of NP -->
   <edge from="p8_4" to "p8_5"/>  <!-- (NP the house) is a part of a PP -->

   ... etc...

The above is how you represent a tree in eXGL.  Yikes!! Extremely
verbose, and its utterly opaque; it needs some automated wysiwyg
tool to view the thing.  Debugging becomes difficult and tedious,
instead of something that can be done with a glance.

TigerXML is more or less the same thing, except they're called "terminals"
and "non-terminals" instead of nodes and edges.

To me, the s-expr is vastly superior in both size and readability to that
offered by eXGL and TigerXML.

Enough of that.
================
Triples. As I mentioned, triples are really really hot right now on the
"semantic web", and not at all procrustean. They've been heavily
standardized, see for example:  http://www.w3.org/TR/rdf-mt/  and
are widely used for blog syndication, and are being explored for
building ontologies (OWL, the "Web Ontology Language", a
follow-on to DAML+OIL, etc)  and knowledge-bases, queryable
by using SPARQL and a host of other acronyms that'll make your
head spin.  Triples promise to be the foundation of web-3.0
So -- for example:

   (Berlin capital-of Germany)

is a triple one might soon expect to get out of wikipedia.  Here's
a listing of just some triple databases so far:
http://protegewiki.stanford.edu/index.php/Protege_Ontology_Library

In my case, my triples are:
   _pobj(next_to, house)
   _psubj(next_to, garage)

err. maybe this is a really bad example for showing an idiosyncratic
treatment of prepositions ... but anyway .. these are two triples.
How to represent these as XML?  Both eGXL and TigerXML again
convert these into opaque spaghetti.

I guess RDF is one candidate worth exploring in greater detail,
in particular, the N-triple format:

  http://www.w3.org/TR/rdf-testcases/#ntriples

which is -- lo and behold -- just a plain, very readable listing, which might
look like, e.g.

<http://opencog.org/relations/1.0/pobj>  next_to house
<http://opencog.org/relations/1.0/psubj>  next_to garage

and there's even a stunt to shorten "http://opencog.org/relations/1.0/psubj"
into something shorter, by defining an alias for it:

@prefix rel: <http://opencog.org/relations/1.0>.

and so: rel:pobj next_to house    etc. would be nice and compact.

Anyway, its more or less trivial to convert

   _pobj(next_to, house)
   _psubj(next_to, garage)

into N-triples, and thence to RDF, or back in the other direction.
Oh, there's also the "Notation 3" for  RDF triples, see

  http://en.wikipedia.org/wiki/Notation_3

and also "turtle" (Terse RDF Triple Language)

 http://en.wikipedia.org/wiki/Turtle_%28syntax%29

Caution, I am not an RDF expert -- this is the limit of what I know.

--linas

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list