[Corpora-List] Announcement: Release of the DependencyTreebankDatabase DTDB 1.0

tufis at racai.ro tufis at racai.ro
Wed Feb 6 16:59:57 UTC 2008

Dear Olga,

At http://corp.hum.sdu.dk Eckhard Bick created a great grammatically  
annotated Romanian corpus and some other similar annotated corpora for  
various languages.

The Romanian corpus covers the business language domain and has a size  
of 21.4 million words (27 million tokens). It was compiled by Arina  
Greavu (arinagreavu at yahoo.com) from news text sources, and annotated  
with (a) PoS and morphology using our tagger, as well as (b) syntactic  
function and shallow dependency markers using a Constraint Grammar  
system at VISL (http://beta.visl.sdu.dk/constraint_grammar.html). You  
might get further information from Eckhard (eckhard.bick at mail.dk)

Best regards,



From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf  
Of Olga Pustylnikov
Sent: 5 februarie 2008 10:49
To: corpora at uib.no
Subject: Re: [Corpora-List] Announcement: Release of the  
DependencyTreebankDatabase DTDB 1.0

Dear Sabine,

thank you for your response and for your helpful hints. Our goals  
behind the conversion into eGXL were mainly:
1. to get a unification by means of XML
2. to integrate the treebanks into a corpus management system where  
not only treebanks but as well spoken and web corpora are  
stored/retrieved by means of GXL (an XML based graph representation  

Restricting on a one specific format is always bound to additional  
adaptations of your application when you have to deal with a new  
treebank. Thus, we tried to select a format which is generic enough to  
be reused and which is suitable for treebanks. GXL is a generic graph  
model which allows to represent any kinds of corpora, since you can  
represent any sorts of relations in terms of a graph. That makes GXL a  
useful means for corpus retrieval. Treebanks can easily be mapped to  
it (since trees are special cases of graphs). eGXL slightly modifies  
GXL in order to account for specifics of treebanks. Thus, we selected  
this format while it meets both requirements - to be generic and  
suitable for treebanks.

In my paper I don't provide a detailed comparison of eGXL to other  
formats. However, CoNNL is referred to by comparing the treebanks,  
although only indirectly. Please send me a reference to your work,  
which I've missed to mention in this paper and I will consider it in  
my future work.

Best regards,

On Feb 3, 2008 1:39 PM, Sabine Buchholz  
<sabine.buchholz at crl.toshiba.co.uk> wrote:

Dear Olga,
I think uniform formats for treebanks are a good idea and therefore read
your announcement, Wiki page and article with interest. However, that raised
a lot of questions:
You clearly are aware of the CoNLL-X shared task on multilingual dependency
parsing, as you link to its home page from your Wiki. For that task 13
treebanks were converted to a uniform format, many of them among the 11 you
list. Our goal was probably different from yours but
1) Why is that work not even mentioned in the paper, let alone compared to?
2) What part of the analyses you did for the paper could you not have done
using the CoNLL-X format?
You even seem to have used the CoNLL-X version of some treebanks (e.g.
Dutch) as the basis of your eGXL conversion (the Dutch example in your paper
is in CoNLL-X and not the original Alpino format).
3) Why did you choose to do that? The conversion from Alpino to CoNLL-X
format looses some information, so why not convert from the original format?
Same potentially for Swedish and Bulgarian.

With regard to your question about other treebanks to add to your database:
in addition to the remainder of the 13 CoNLL-X treebanks and the new ones
converted for the successor (the CoNLL 2007 shared task on dependency
parsing), http://en.wikipedia.org/wiki/Treebank lists even more treebanks.
But you probably already know that, you link to it from your Wiki...
Although I just noticed that the Romanian treebank you used is still missing
from that list...

Looking forward to hearing from you,
kind regards,
Sabine Buchholz

----- Original Message -----
From: Olga Pustylnikov
To: corpora at uib.no
Sent: Friday, February 01, 2008 9:31 AM
Subject: [Corpora-List] Announcement: Release of the Dependency
TreebankDatabase DTDB 1.0

Dear list members,
I'm happy to announce the release of DTDB 1.0, a Dependency Treebank
DataBase. The database consists of 11 languages which are transformed into a
single representation format. This format is an XML based graph model, and
it was designed to support the interoperability of existing corpora.
The wiki http://ariadne.coli.uni-bielefeld.de/wikis/treebankwiki/ presents
the treebanks and the unification format used. Details about the format are
also described in:
My question is: do other treebanks exist which are not part of the database?
If you know of an existing treebank that should be transformed into the
unified format please, let me know.

Olga Pustylnikov

Universität Bielefeld
Fakultät für Linguistik und Literaturwissenschaft
Universitätsstraße 25
D-33615 Bielefeld

olga.pustylnikov at uni-bielefeld.de

This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email

Olga Pustylnikov

Universität Bielefeld
Fakultät für Linguistik und Literaturwissenschaft
Universitätsstraße 25
D-33615 Bielefeld

olga.pustylnikov at uni-bielefeld.de

This message was sent using IMP, the Internet Messaging Program.
Host: valhalla.racai.ro
Version: IMP 4.1.5 (H3) (Horde 3.1.5)

This message was scanned for spam and viruses by BitDefender.
For more information please visit http://linux.bitdefender.com/

Corpora mailing list
Corpora at uib.no

More information about the Corpora mailing list