[Corpora-List] Announcement: Release of the DependencyTreebankDatabase DTDB 1.0
tufis at racai.ro
tufis at racai.ro
Wed Feb 6 16:59:57 UTC 2008
Dear Olga,
At http://corp.hum.sdu.dk Eckhard Bick created a great grammatically
annotated Romanian corpus and some other similar annotated corpora for
various languages.
The Romanian corpus covers the business language domain and has a size
of 21.4 million words (27 million tokens). It was compiled by Arina
Greavu (arinagreavu at yahoo.com) from news text sources, and annotated
with (a) PoS and morphology using our tagger, as well as (b) syntactic
function and shallow dependency markers using a Constraint Grammar
system at VISL (http://beta.visl.sdu.dk/constraint_grammar.html). You
might get further information from Eckhard (eckhard.bick at mail.dk)
Best regards,
Dan
--------------------------------------------------------------------------------
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf
Of Olga Pustylnikov
Sent: 5 februarie 2008 10:49
To: corpora at uib.no
Subject: Re: [Corpora-List] Announcement: Release of the
DependencyTreebankDatabase DTDB 1.0
Dear Sabine,
thank you for your response and for your helpful hints. Our goals
behind the conversion into eGXL were mainly:
1. to get a unification by means of XML
2. to integrate the treebanks into a corpus management system where
not only treebanks but as well spoken and web corpora are
stored/retrieved by means of GXL (an XML based graph representation
format).
Restricting on a one specific format is always bound to additional
adaptations of your application when you have to deal with a new
treebank. Thus, we tried to select a format which is generic enough to
be reused and which is suitable for treebanks. GXL is a generic graph
model which allows to represent any kinds of corpora, since you can
represent any sorts of relations in terms of a graph. That makes GXL a
useful means for corpus retrieval. Treebanks can easily be mapped to
it (since trees are special cases of graphs). eGXL slightly modifies
GXL in order to account for specifics of treebanks. Thus, we selected
this format while it meets both requirements - to be generic and
suitable for treebanks.
In my paper I don't provide a detailed comparison of eGXL to other
formats. However, CoNNL is referred to by comparing the treebanks,
although only indirectly. Please send me a reference to your work,
which I've missed to mention in this paper and I will consider it in
my future work.
Best regards,
On Feb 3, 2008 1:39 PM, Sabine Buchholz
<sabine.buchholz at crl.toshiba.co.uk> wrote:
Dear Olga,
I think uniform formats for treebanks are a good idea and therefore read
your announcement, Wiki page and article with interest. However, that raised
a lot of questions:
You clearly are aware of the CoNLL-X shared task on multilingual dependency
parsing, as you link to its home page from your Wiki. For that task 13
treebanks were converted to a uniform format, many of them among the 11 you
list. Our goal was probably different from yours but
1) Why is that work not even mentioned in the paper, let alone compared to?
2) What part of the analyses you did for the paper could you not have done
using the CoNLL-X format?
You even seem to have used the CoNLL-X version of some treebanks (e.g.
Dutch) as the basis of your eGXL conversion (the Dutch example in your paper
is in CoNLL-X and not the original Alpino format).
3) Why did you choose to do that? The conversion from Alpino to CoNLL-X
format looses some information, so why not convert from the original format?
Same potentially for Swedish and Bulgarian.
With regard to your question about other treebanks to add to your database:
in addition to the remainder of the 13 CoNLL-X treebanks and the new ones
converted for the successor (the CoNLL 2007 shared task on dependency
parsing), http://en.wikipedia.org/wiki/Treebank lists even more treebanks.
But you probably already know that, you link to it from your Wiki...
Although I just noticed that the Romanian treebank you used is still missing
from that list...
Looking forward to hearing from you,
kind regards,
Sabine Buchholz
----- Original Message -----
From: Olga Pustylnikov
To: corpora at uib.no
Sent: Friday, February 01, 2008 9:31 AM
Subject: [Corpora-List] Announcement: Release of the Dependency
TreebankDatabase DTDB 1.0
Dear list members,
I'm happy to announce the release of DTDB 1.0, a Dependency Treebank
DataBase. The database consists of 11 languages which are transformed into a
single representation format. This format is an XML based graph model, and
it was designed to support the interoperability of existing corpora.
The wiki http://ariadne.coli.uni-bielefeld.de/wikis/treebankwiki/ presents
the treebanks and the unification format used. Details about the format are
also described in:
http://ariadne.coli.uni-bielefeld.de/pustylnikov/pdfs/acl07.1.0.pdf
My question is: do other treebanks exist which are not part of the database?
If you know of an existing treebank that should be transformed into the
unified format please, let me know.
--
Olga Pustylnikov
Universität Bielefeld
Fakultät für Linguistik und Literaturwissenschaft
Universitätsstraße 25
D-33615 Bielefeld
http://ariadne.coli.uni-bielefeld.de/pustylnikov/
olga.pustylnikov at uni-bielefeld.de
______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
______________________________________________________________________
--
Olga Pustylnikov
Universität Bielefeld
Fakultät für Linguistik und Literaturwissenschaft
Universitätsstraße 25
D-33615 Bielefeld
http://ariadne.coli.uni-bielefeld.de/pustylnikov/
olga.pustylnikov at uni-bielefeld.de
----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
Host: valhalla.racai.ro
Version: IMP 4.1.5 (H3) (Horde 3.1.5)
--
This message was scanned for spam and viruses by BitDefender.
For more information please visit http://linux.bitdefender.com/
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list