Dear Dr. Vepstas,<br><br>eGXL: <a href="http://ariadne.coli.uni-bielefeld.de/wikis/treebankwiki/index.php5/Main_Page" target="_blank">http://ariadne.coli.uni-bielefeld.de/wikis/treebankwiki/index.php5/Main_Page</a> <br>might be a possible alternative format.It is generic allowing to represent any kinds of graph structures, like constituent, dependency, discontinuous formalisms as well as combinations of them. The storage costs are also small, details on the evaluation of the format are described in:<br>
<pre><pre><pre><pre><pre><pre><pre><pre><pre><pre><pre><pre><pre><pre><pre><pre><pre>@inproceedings{Pustylnikov:Mehler:Gleim:2008,<br> author={Olga Pustylnikov and Alexander Mehler and Rüdiger Gleim},<br> title={A Unified Database of Dependency Treebanks. {Integrating}, Quantifying<br>
\& Evaluating Dependency Data},<br> booktitle={Proceedings of the 6th Language Resources and Evaluation Conference<br> (LREC 2008), Marrakech (Morocco)},<br> year={2008}<br>}</pre></pre></pre></pre></pre></pre>
</pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre><br>Best,<br><br><div class="gmail_quote">On Wed, Jul 2, 2008 at 10:17 PM, Linas Vepstas <<a href="mailto:linasvepstas@gmail.com" target="_blank">linasvepstas@gmail.com</a>> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Hi,<br>
<br>
I'd like to announce a project, and solicit for comments on the<br>
project proceedings.<br>
<br>
As a part of this year's Google Summer of Code, we have a Boston University<br>
student preparing a web crawler, whose goal is to crawl some part of the web<br>
(as large as we can manage) and prepare a parsed version of the text found.<br>
The project is being done in collaboration with SIAI, the Apache project, Wikia<br>
search and Novamente (my employer), with Wikia providing the servers and<br>
disk space.<br>
<br>
The results of the parsing will be publicly available, for all to use.<br>
I believe that<br>
this should result in a very large collection of pre-parsed corpi, which should<br>
allow new types of statistical analysis to be performed -- and thus, I hope,<br>
of interest to this list. Four types of parse information will be available:<br>
-- word features (part-of-speech, lemmas, tense, noun-number, etc.)<br>
-- tree-bank-style constituent tree<br>
-- dependency-grammar-style dependency relations<br>
-- list of link-grammar relations<br>
<br>
I was unable to find a file format that could handle all of this, and was forced<br>
to invent my own. I would really like to get criticism/feedback/suggestions on<br>
the file format. The proposed file format is described here:<br>
<br>
<a href="http://opencog.org/wiki/RelEx_compact_output" target="_blank">http://opencog.org/wiki/RelEx_compact_output</a><br>
<br>
I believe that the proposed file format is "generic", and could be used<br>
with any parsing technology, and not just the parser that we are planning<br>
to use -- thus, I think its of general interest.<br>
<br>
The technology that we are planning to use is the CMU link-grammar<br>
parser, with a layer on top, RelEx, which extract dependency-grammar-like<br>
relations from it. RelEx is described at<br>
<br>
<a href="http://opencog.org/wiki/RelEx" target="_blank">http://opencog.org/wiki/RelEx</a><br>
<br>
Feedback, comments appreciated.<br>
<br>
-- Dr. Linas Vepstas<br>
<br>
_______________________________________________<br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
<br>
</blockquote></div><br><br clear="all"><br>-- <br>Olga Pustylnikov<br><br>Universität Bielefeld<br>Fakultät für Linguistik und Literaturwissenschaft<br>Universitätsstraße 25<br>D-33615 Bielefeld<br><br><a href="http://ariadne.coli.uni-bielefeld.de/pustylnikov/" target="_blank">http://ariadne.coli.uni-bielefeld.de/pustylnikov/</a><br>
<a href="mailto:olga.pustylnikov@uni-bielefeld.de" target="_blank">olga.pustylnikov@uni-bielefeld.de</a>