[Corpora-List] Parsed corpus file format
Emiliano Guevara
emiliano.guevara at unibo.it
Thu Jul 3 06:37:19 UTC 2008
Hi Linas,
just out of curiosity:
- how large do you expect it to be?
- what taggers, parsers, etc. are you planning to use?
- what corpus management system are you planning to use? you mentioned
Apache Lucene, but Lucene is not able to deal with richly annotated
corpora.
- I guess you are talking about English only, are you aware of the
Wacky initiative?
People involved in the Wacky initiative have been doing more or less
what you are telling the list for at least a couple of years, except
for the parsing (but the produced corpora can be parsed anytime...).
http://wacky.sslmit.unibo.it/
That's the part about the state of the art...
Second, answering your question about the file format:
the link you provided looks very much like the spurious XML you have
to use with the CQP/CWB, and I'm very used to seeing my corpora in
that format.
However, if your goal is to have the best possible distribution
format, then why not sticking to a recognised standard (XML, of
course) and avoid reinventing the wheel. The TEI XML guideline is
probably the de facto standard:
http://www.tei-c.org/
best wishes,
E.
On Jul 2, 2008, at 22:17 PM, Linas Vepstas wrote:
> As a part of this year's Google Summer of Code, we have a Boston
> University
> student preparing a web crawler, whose goal is to crawl some part of
> the web
> (as large as we can manage) and prepare a parsed version of the text
> found.
****************************************
Emiliano R. Guevara
Facoltà di Lingue e Lett. Straniere
Dipart. di Lingue e Lett. Straniere
Università di Bologna
Via Cartoleria 5 (40124) Bologna, Italia
http://morbo.lingue.unibo.it/
emiliano.guevara at unibo.it
emiguevara at gmail.com
****************************************
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list