[Corpora-List] German treebanks - new releases
Heike Zinsmeister
heike.zinsmeister at uni-tuebingen.de
Mon Nov 14 08:45:34 UTC 2005
The Division of Computational Linguistics at the Seminar fuer
Sprachwissenschaft
of the University of Tuebingen (Germany) is happy to announce the
release of
two German language resources:
* The Tuebingen Treebank of Spoken German (TueBa-D/S)
* The Tuebingen Treebank of Written German (TueBa-D/Z) - second release
Both treebanks have the same basic annotation scheme which
distinguishes four levels of syntactic constituency: the lexical level,
the phrasal level, the level of topological fields, and the clausal level.
In addition to constituent structure, annotated trees contain edge labels
between nodes which encode grammatical functions.
Both treebanks are available in 3 different formats:
* NEGRA export format
* XML format
* Penn Treebank format
The treebanks in detail:
1. The Tuebingen Treebank of Spoken German (TueBa-D/S)
The TueBa-D/S treebank was annotated in the project Verbmobil,
a longterm Machine Translation project for spontaneous speech funded
by the German Ministry for Education, Science, Research, and
Technology (BMBF). This is the first public release of the treebank.
TueBa-D/S is a syntactically annotated corpus based on spontaneous
dialogues,
which were manually transliterated. The treebank comprises approximately
38 000 sentences (ca. 360 000 words). The syntactic annotation was also
performed manually.
The license for TueBa-D/S is granted free of charge for scientific use.
For more information, please refer to:
http://www.sfs.uni-tuebingen.de/en_tuebads.shtml
2. The Tuebingen Treebank of Written German (TueBa-D/Z) - second release
The TueBa-D/Z treebank is a manually annotated, German newspaper
corpus based on data taken from the daily issues of the 'die tageszeitung'.
It currently comprises approximately 22 000 sentences (ca. 380 000 words).
The annotation scheme is an extended version of the TueBa-D/S annotation
scheme. It accounts for a larger number of linguistic phenomena and is
enriched at two levels: (multi-word) named entities are marked at the
phrasal level;
words are annotated with inflectional morphology at the lexical level
(currently ca. 70% of the sentences are covered).
What is new in the second release:
- about 6 800 additional sentences
- morphological information
- cleaner versions of the trees published in the first release
The license for TueBa-D/Z is granted free of charge for scientific use.
For more information, please refer to:
http://www.sfs.uni-tuebingen.de/en_tuebadz.shtml
With best regards,
Erhard W. Hinrichs
Sandra Kübler
Heike Zinsmeister
-------------------------------------------------------
For your information:
A related resource is The Tuebingen Partially Parsed Corpus of
Written German (TuePP-D/Z), released 12/2003.
TuePP-D/Z is a 200 million word collection of articles from the taz
newspaper
which have been automatically annotated with clause structure,
topological fields,
and chunks, in addition to more low level annotation including parts of
speech
and morphological ambiguity classes.
For more information, please refer to:
http://www.sfs.uni-tuebingen.de/en_tuepp.shtml
More information about the Corpora
mailing list