[Corpora-List] Parsed corpus file format

Wed Jul 2 20:49:27 UTC 2008

I'm not sure if the crawling bit of the project has much value in itself: it takes a lot of time to setup a crawl and design all necessary post-processing stages, but this has been done many times, for an example see the ukWac paper at the last Web as Corpus workshop:
http://webascorpus.sf.net/WAC4

I think it might be more sensible to start with an existing collection such as ukWaC or SPIRIT and enhance it by adding as much annotation as you can manage.
Serge

-----Original Message-----
From: corpora-bounces at uib.no on behalf of Linas Vepstas
Sent: Wed 02/07/2008 21:17
To: corpora at uib.no
Cc: David Hart; Rich Jones
Subject: [Corpora-List] Parsed corpus file format

Hi,

I'd like to announce a project, and solicit for comments on the
project proceedings.

As a part of this year's Google Summer of Code, we have a Boston University
student preparing a web crawler, whose goal is to crawl some part of the web
(as large as we can manage) and prepare a parsed version of the text found.
The project is being done in collaboration with SIAI, the Apache project, Wikia
search and Novamente (my employer), with Wikia providing the servers and
disk space.

The results of the parsing will be publicly available, for all to use.
I believe that
this should result in a very large collection of pre-parsed corpi, which should
allow new types of statistical analysis to be performed -- and thus, I hope,
of interest to this list.  Four types of parse information will be available:
-- word features (part-of-speech, lemmas, tense, noun-number, etc.)
-- tree-bank-style constituent tree
-- dependency-grammar-style dependency relations
-- list of link-grammar relations

I was unable to find a file format that could handle all of this, and was forced
to invent my own. I would really like to  get criticism/feedback/suggestions on
the file format. The proposed file format is described here:

    http://opencog.org/wiki/RelEx_compact_output

I believe that the proposed file format is "generic", and could be used
with any parsing technology, and not just the parser that we are planning
to use -- thus, I think its of general interest.

The technology that we are planning to use is the CMU link-grammar
parser, with a layer on top, RelEx, which extract dependency-grammar-like
relations from it.   RelEx is described at

   http://opencog.org/wiki/RelEx

Feedback, comments appreciated.

-- Dr. Linas Vepstas

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora