[Corpora-List] Parsed corpus file format

Thu Jul 3 15:31:51 UTC 2008

Hi,

2008/7/3 Emiliano Guevara <emiliano.guevara at unibo.it>:
>
> - how large do you expect it to be?

Not sure. The project lead is David Hart, and I was hoping he'd
chip in here.  The nominal answer is "until the sysadmins get
bored", or something like that.  It depends partly on how useful
the stuff is perceived to be.

> - what taggers, parsers, etc. are you planning to use?

Currently, only link-grammar + relex.  I beleive the proposed file
format can accomodate arbitrary systems.  Not also: part of the
long-term plan is to periodically reparse texts as the parse tools
improve.

> - what corpus management system are you planning to use? you mentioned
> Apache Lucene, but Lucene is not able to deal with richly annotated corpora.

We've been talking to the Lucene developers, and they've failed to
mention this. I don't know why.

> - I guess you are talking about English only, are you aware of the Wacky
> initiative?

I just found out about it yesterday.

> People involved in the Wacky initiative have been doing more or less what
> you are telling the list for at least a couple of years,

Perhaps David Hart can open up a conversation here ... !?

> except for the
> parsing (but the produced corpora can be parsed anytime...).

>>From direct experience, parsing is a limiting factor: it takes immense
amounts of cpu time. So my interest is partly that of greed: I'd like to
have access to large quantities of parsed text, and I can't have this by
running ad-hoc parses on my assortment of home computers and
borrowed time on mainframes.

> http://wacky.sslmit.unibo.it/
>
> That's the part about the state of the art...
>
> Second, answering your question about the file format:
> the link you provided looks very much like the spurious XML you have to use
> with the CQP/CWB, and I'm very used to seeing my corpora in that format.
> However, if your goal is to have the best possible distribution format, then
> why not sticking to a recognised standard (XML, of course) and avoid
> reinventing the wheel. The TEI XML guideline is probably the de facto
> standard:
>
> http://www.tei-c.org/

Do you have any experience with this?  I can't figure out how to use it.
Yes, I could add a teiHeader, not a bad idea.  And I found WordHoard
by following the links, it was very suggestive, and shared a number of
good ideas with TigerXML, which I'm thinking of incorporating.  But I
didn't see a more direct way of making use of TEI. ... ?

--linas

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora