[Corpora-List] Parsed corpus file format

Linas Vepstas linasvepstas at gmail.com
Wed Jul 2 20:17:18 UTC 2008


Hi,

I'd like to announce a project, and solicit for comments on the
project proceedings.

As a part of this year's Google Summer of Code, we have a Boston University
student preparing a web crawler, whose goal is to crawl some part of the web
(as large as we can manage) and prepare a parsed version of the text found.
The project is being done in collaboration with SIAI, the Apache project, Wikia
search and Novamente (my employer), with Wikia providing the servers and
disk space.

The results of the parsing will be publicly available, for all to use.
I believe that
this should result in a very large collection of pre-parsed corpi, which should
allow new types of statistical analysis to be performed -- and thus, I hope,
of interest to this list.  Four types of parse information will be available:
-- word features (part-of-speech, lemmas, tense, noun-number, etc.)
-- tree-bank-style constituent tree
-- dependency-grammar-style dependency relations
-- list of link-grammar relations

I was unable to find a file format that could handle all of this, and was forced
to invent my own. I would really like to  get criticism/feedback/suggestions on
the file format. The proposed file format is described here:

    http://opencog.org/wiki/RelEx_compact_output

I believe that the proposed file format is "generic", and could be used
with any parsing technology, and not just the parser that we are planning
to use -- thus, I think its of general interest.

The technology that we are planning to use is the CMU link-grammar
parser, with a layer on top, RelEx, which extract dependency-grammar-like
relations from it.   RelEx is described at

   http://opencog.org/wiki/RelEx

Feedback, comments appreciated.

-- Dr. Linas Vepstas

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list