[Corpora-List] Parsed corpus file format

Thu Jul 3 06:37:19 UTC 2008

Hi Linas,

just out of curiosity:

- how large do you expect it to be?

- what taggers, parsers, etc. are you planning to use?

- what corpus management system are you planning to use? you mentioned  
Apache Lucene, but Lucene is not able to deal with richly annotated  
corpora.

- I guess you are talking about English only, are you aware of the  
Wacky initiative?

People involved in the Wacky initiative have been doing more or less  
what you are telling the list for at least a couple of years, except  
for the parsing (but the produced corpora can be parsed anytime...).

http://wacky.sslmit.unibo.it/

That's the part about the state of the art...

Second, answering your question about the file format:
the link you provided looks very much like the spurious XML you have  
to use with the CQP/CWB, and I'm very used to seeing my corpora in  
that format.
However, if your goal is to have the best possible distribution  
format, then why not sticking to a recognised standard (XML, of  
course) and avoid reinventing the wheel. The TEI XML guideline is  
probably the de facto standard:

http://www.tei-c.org/

best wishes,

E.

On Jul 2, 2008, at 22:17 PM, Linas Vepstas wrote:

> As a part of this year's Google Summer of Code, we have a Boston  
> University
> student preparing a web crawler, whose goal is to crawl some part of  
> the web
> (as large as we can manage) and prepare a parsed version of the text  
> found.

****************************************
Emiliano R. Guevara
Facoltà di Lingue e Lett. Straniere
Dipart. di Lingue e Lett. Straniere
Università di Bologna
Via Cartoleria 5 (40124) Bologna, Italia
   http://morbo.lingue.unibo.it/
   emiliano.guevara at unibo.it
   emiguevara at gmail.com
****************************************

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora