24.5094, FYI: New Release of the T=?UTF-8?Q?=C3=BCBa-D/Z_?=German Treebank

linguist at linguistlist.org linguist at linguistlist.org
Thu Dec 12 15:00:41 UTC 2013

LINGUIST List: Vol-24-5094. Thu Dec 12 2013. ISSN: 1069 - 4875.

Subject: 24.5094, FYI: New Release of the TüBa-D/Z German Treebank

Moderator: Damir Cavar, Eastern Michigan U <damir at linguistlist.org>

Monica Macaulay, U of Wisconsin Madison
Rajiv Rao, U of Wisconsin Madison
Joseph Salmons, U of Wisconsin Madison
Mateja Schuck, U of Wisconsin Madison
Anja Wanner, U of Wisconsin Madison
       <reviews at linguistlist.org>

Homepage: http://linguistlist.org

Do you want to donate to LINGUIST without spending an extra penny? Bookmark
the Amazon link for your country below; then use it whenever you buy from

USA: http://www.amazon.com/?_encoding=UTF8&tag=linguistlist-20
Britain: http://www.amazon.co.uk/?_encoding=UTF8&tag=linguistlist-21
Germany: http://www.amazon.de/?_encoding=UTF8&tag=linguistlistd-21
Japan: http://www.amazon.co.jp/?_encoding=UTF8&tag=linguistlist-22
Canada: http://www.amazon.ca/?_encoding=UTF8&tag=linguistlistc-20
France: http://www.amazon.fr/?_encoding=UTF8&tag=linguistlistf-21

For more information on the LINGUIST Amazon store please visit our
FAQ at http://linguistlist.org/amazon-faq.cfm.

Editor for this issue: Uliana Kazagasheva <uliana at linguistlist.org>

Date: Thu, 12 Dec 2013 10:00:24
From: Marie Hinrichs [marie.hinrichs at uni-tuebingen.de]
Subject: New Release of the TüBa-D/Z German Treebank

E-mail this message to a friend:
The Department of Linguistics of the University of Tübingen (Germany) is pleased to announce the new release of a referentially and syntactically annotated German corpus:

The Tübingen Treebank of Written German (TüBa-D/Z) - 9th release

The TüBa-D/Z treebank is a manually annotated German newspaper corpus based on data taken from the daily issues of the 'die tageszeitung'. It currently comprises 85,358 sentences (1,569,916 words; 3,444 newspaper articles).

The syntactic annotation scheme of the TüBa-D/Z distinguishes four levels of syntactic constituency: the lexical level, the phrasal level, the level of topological fields, and the clausal level.

The treebank has been enriched with anaphoric and coreference relations referring to nominal and pronominal antecedents. Linking relations include: coreferential (two NPs refer to the same extra-linguistic referent), anaphoric/cataphoric (a definite pronoun refers to a contextual antecedent) and other relations (split-antecedent, instance) as well as marking of expletive pronouns.
(Complex) named entities are classified as organisation, person, location, geo-political entity, and other.

For selected discourse connectives, the instances occurring in the treebank have been annotated with the discourse relation(s) conveyed by the connective instance. Portions of the treebank have been sense-annotated for the connectives 'nachdem' (298 instances), 'während' (531 instances), 'sobald' (28 instances), 'seitdem' (13 instances), 'als' (169 instances), 'aber' (161 instances), and 'bevor' (119 instances).

Another annotation layer contains structural information as well as implicit discourse relations for a subcorpus of 41 annotated newspaper articles (21,817 tokens) with 1,458 (explicit and implicit) discourse relations.

The annotation comprises information on 
- Inflectional morphology
- Lemmas 
- Syntactic constituency 
- Grammatical functions 
- (Complex) named entities incl. semantic classification 
- Anaphora and coreference relations
- Discourse connectives (partial coverage)
- Dependency relations (automatically created)
- Chunk annotation (automatically created) 

The treebank is available in these formats: 
- NEGRA export format 
- XML format (TigerXML, exportXML, and TCF)
- Penn Treebank format
- CoNLL format

The license for TueBa-D/Z is granted free of charge for scientific use. 
For more information, please refer to: 

With best regards, 

Erhard W. Hinrichs
Kathrin Beck
Heike Telljohann
Marie Hinrichs

Dept. of Computational Linguistics
University of Tübingen
Wilhelmstr. 19
72074 Tübingen

Linguistic Field(s): Computational Linguistics
                     Discourse Analysis
                     Text/Corpus Linguistics

Subject Language(s): German (deu)


LINGUIST List: Vol-24-5094	

More information about the Linguist mailing list