[Corpora-List] Parsed corpus file format

Linas Vepstas linasvepstas at gmail.com
Wed Jul 2 23:18:37 UTC 2008


Hi Olga,

2008/7/2 Olga Pustylnikov <olga.pustylnikov at uni-bielefeld.de>:
>
> eGXL:
> http://ariadne.coli.uni-bielefeld.de/wikis/treebankwiki/index.php5/Main_Page
> might be a possible alternative format.

Hi Olga, thanks for the note. I looked at both this and TigerXML.
Let me reply quickly, with some first impressions:

-- TigerXML has a very nicely laid-out "head" section, with a
section for document meta-data, and a section describing
the types of relations to follow. I like this.

-- The downside of requiring relations to be declared up-front is
that these cause problems for me, where the prepositional relations
are obtained on-the-fly from the text: I have no a-priori knowledge
of knowing what these are.  Thus, requiring these to be in the
<head> section, before the parsed <body> section, creates a
great burden -- either a two-pass parse, or some way of buffering
the parse info until the parse is done. This is a serious shortcoming.

-- Almost all linguistic data comes in one of two forms: trees, and
triples. (By triple, I mean e.g. "subj(throw, John) for the subject of
"John threw a rock").   The people behind the "semantic web"
have discovered the general idea of triples (not just for dependency
grammars or linguistics, but also for ontologies, and etc. Most generally,
"RDF", resource description framework triples. etc.), and
there are a huge number projects now that deal explicitly
with RDF triples as  a basic building block.

Unfortunately, both TigerXML and eGXL represent the triples in a
non-human-readable format, and, worse, the way these triples are
represented are rather verbose.  I would really like to see these
be much more compact, and human readable.

The other problem is with trees: the "natural" way of representing
trees are with S-expressions, unfortunately, XML makes this hard.
Trees can also be represented as collections of triples, something
that TigerXML tries to do, although it has to insert bogus ID nodes
for any non-terminal tree nodes that would otherwise be unlabelled.

I would reallly really like to see some simple, S-expression-like
format for trees ... something as readable as S-expressions, but
fitting into the XML mind-set.

(To throw a rock at my own format -- its a quick hack job -- and
has many faults. -- which I won't expand on here.)

--linas

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list