[Corpora-List] New Release of the T=?utf-8?Q?=C3=BCBa-D/Z_?=German Treebank
Marie Hinrichs
marie.hinrichs at uni-tuebingen.de
Thu Dec 12 10:56:49 UTC 2013
The Department of Linguistics of the University of Tübingen (Germany) is
pleased to announce the new release of a referentially and syntactically
annotated German corpus:
* The Tübingen Treebank of Written German (TüBa-D/Z) - 9th release
The TüBa-D/Z treebank is a manually annotated German newspaper corpus
based on data taken from the daily issues of the 'die tageszeitung'. It
currently comprises 85,358 sentences (1,569,916 words; 3,444 newspaper
articles).
The syntactic annotation scheme of the TüBa-D/Z distinguishes four
levels of syntactic constituency: the lexical level, the phrasal level,
the level of topological fields, and the clausal level.
The treebank has been enriched with anaphoric and coreference relations
referring to nominal and pronominal antecedents. Linking relations
include: coreferential (two NPs refer to the same extra-linguistic
referent), anaphoric/cataphoric (a definite pronoun refers to a
contextual antecedent) and other relations (split-antecedent, instance)
as well as marking of expletive pronouns.
(Complex) named entities are classified as organisation, person,
location, geo-political entity, and other.
For selected discourse connectives, the instances occurring in the
treebank have been annotated with the discourse relation(s) conveyed by
the connective instance. Portions of the treebank have been
sense-annotated for the connectives 'nachdem' (298 instances), 'während'
(531 instances), 'sobald' (28 instances), 'seitdem' (13 instances),
'als' (169 instances), 'aber' (161 instances), and 'bevor' (119 instances).
Another annotation layer contains structural information as well as
implicit discourse relations for a subcorpus of 41 annotated newspaper
articles (21,817 tokens) with 1,458 (explicit and implicit) discourse
relations.
The annotation comprises information on
* Inflectional morphology
* Lemmas
* Syntactic constituency
* Grammatical functions
* (Complex) named entities incl. semantic classification
* Anaphora and coreference relations
* Discourse connectives (partial coverage)
* Dependency relations (automatically created)
* Chunk annotation (automatically created)
The treebank is available in these formats:
* NEGRA export format
* XML format (TigerXML, exportXML, and TCF)
* Penn Treebank format
* CoNLL format
The license for TueBa-D/Z is granted free of charge for scientific use.
For more information, please refer to:
http://www.sfs.uni-tuebingen.de/en/ascl/resources/corpora/tueba-dz.html
With best regards,
Erhard W. Hinrichs
Kathrin Beck
Heike Telljohann
Marie Hinrichs
------------
Dept. of Computational Linguistics
University of Tübingen
Wilhelmstr. 19
72074 Tübingen
Germany
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list