[Corpora-List] Parsed corpus file format

Olga Pustylnikov olga.pustylnikov at uni-bielefeld.de
Wed Jul 2 21:40:40 UTC 2008


Dear Dr. Vepstas,

eGXL:
http://ariadne.coli.uni-bielefeld.de/wikis/treebankwiki/index.php5/Main_Page
might be a possible alternative format.It is generic allowing to represent
any kinds of graph structures, like constituent, dependency, discontinuous
formalisms as well as combinations of them. The storage costs are also
small, details on the evaluation of the format are described in:

@inproceedings{Pustylnikov:Mehler:Gleim:2008,
  author={Olga Pustylnikov and Alexander Mehler and Rüdiger Gleim},
  title={A Unified Database of Dependency Treebanks. {Integrating}, Quantifying
    \& Evaluating Dependency Data},
  booktitle={Proceedings of the 6th Language Resources and Evaluation Conference
    (LREC 2008), Marrakech (Morocco)},
  year={2008}
}


Best,

On Wed, Jul 2, 2008 at 10:17 PM, Linas Vepstas <linasvepstas at gmail.com>
wrote:

> Hi,
>
> I'd like to announce a project, and solicit for comments on the
> project proceedings.
>
> As a part of this year's Google Summer of Code, we have a Boston University
> student preparing a web crawler, whose goal is to crawl some part of the
> web
> (as large as we can manage) and prepare a parsed version of the text found.
> The project is being done in collaboration with SIAI, the Apache project,
> Wikia
> search and Novamente (my employer), with Wikia providing the servers and
> disk space.
>
> The results of the parsing will be publicly available, for all to use.
> I believe that
> this should result in a very large collection of pre-parsed corpi, which
> should
> allow new types of statistical analysis to be performed -- and thus, I
> hope,
> of interest to this list.  Four types of parse information will be
> available:
> -- word features (part-of-speech, lemmas, tense, noun-number, etc.)
> -- tree-bank-style constituent tree
> -- dependency-grammar-style dependency relations
> -- list of link-grammar relations
>
> I was unable to find a file format that could handle all of this, and was
> forced
> to invent my own. I would really like to  get
> criticism/feedback/suggestions on
> the file format. The proposed file format is described here:
>
>    http://opencog.org/wiki/RelEx_compact_output
>
> I believe that the proposed file format is "generic", and could be used
> with any parsing technology, and not just the parser that we are planning
> to use -- thus, I think its of general interest.
>
> The technology that we are planning to use is the CMU link-grammar
> parser, with a layer on top, RelEx, which extract dependency-grammar-like
> relations from it.   RelEx is described at
>
>   http://opencog.org/wiki/RelEx
>
> Feedback, comments appreciated.
>
> -- Dr. Linas Vepstas
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>


-- 
Olga Pustylnikov

Universität Bielefeld
Fakultät für Linguistik und Literaturwissenschaft
Universitätsstraße 25
D-33615 Bielefeld

http://ariadne.coli.uni-bielefeld.de/pustylnikov/
olga.pustylnikov at uni-bielefeld.de
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080702/a0ec9939/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list