[Corpora-List] Parsed corpus file format

Wed Jul 2 20:42:40 UTC 2008

Hi Dr. Vepstas,

have you considered using TIGER XML?
http://www.ims.uni-stuttgart.de/projekte/TIGER/

It allows for annotation at the word level, constituent structure, and
dependency structure, and it can even handle discontinuous
constituents.

Cheers,
Katrin

On Wed, Jul 2, 2008 at 3:17 PM, Linas Vepstas <linasvepstas at gmail.com> wrote:
> Hi,
>
> I'd like to announce a project, and solicit for comments on the
> project proceedings.
>
> As a part of this year's Google Summer of Code, we have a Boston University
> student preparing a web crawler, whose goal is to crawl some part of the web
> (as large as we can manage) and prepare a parsed version of the text found.
> The project is being done in collaboration with SIAI, the Apache project, Wikia
> search and Novamente (my employer), with Wikia providing the servers and
> disk space.
>
> The results of the parsing will be publicly available, for all to use.
> I believe that
> this should result in a very large collection of pre-parsed corpi, which should
> allow new types of statistical analysis to be performed -- and thus, I hope,
> of interest to this list.  Four types of parse information will be available:
> -- word features (part-of-speech, lemmas, tense, noun-number, etc.)
> -- tree-bank-style constituent tree
> -- dependency-grammar-style dependency relations
> -- list of link-grammar relations
>
> I was unable to find a file format that could handle all of this, and was forced
> to invent my own. I would really like to  get criticism/feedback/suggestions on
> the file format. The proposed file format is described here:
>
>    http://opencog.org/wiki/RelEx_compact_output
>
> I believe that the proposed file format is "generic", and could be used
> with any parsing technology, and not just the parser that we are planning
> to use -- thus, I think its of general interest.
>
> The technology that we are planning to use is the CMU link-grammar
> parser, with a layer on top, RelEx, which extract dependency-grammar-like
> relations from it.   RelEx is described at
>
>   http://opencog.org/wiki/RelEx
>
> Feedback, comments appreciated.
>
> -- Dr. Linas Vepstas
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

-- 
Katrin Erk, Department of Linguistics
The University of Texas at Austin
http://comp.ling.utexas.edu/erk

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora