[Corpora-List] Talbanken05 (Swedish Treebank)
Joakim Nivre
nivre at msi.vxu.se
Wed Nov 23 07:54:28 UTC 2005
We are happy to announce the release of Talbanken05 (Version 1.0).
Talbanken05 is a modernized version of Talbanken76, a Swedish treebank of
roughly 300,000 words, constructed at Lund University in the 1970s.
The treebank comes with no guarantee but is freely available for research
and educational purposes as long as proper credit is given for the work
done to produce the material (both in Lund and in Växjö).
The treebank consists of a written language part and a spoken language
part of roughly equal size. The written language part in turn consists of
two sections, the so-called professional prose section (P), with data
from textbooks, brochures, newspapers, etc., and a collection of high
school students' essays (G). The spoken language part also has two
sections, interviews (IB) and conversations and debates (SD). Altogether,
the corpus contains close to 300,000 running tokens.
The distribution contains the entire treebank (divided into sections
P, G, IB and SD) in four versions:
MAMBA: Original syntactic and lexical annotation (and encoding)
FPS: Flat phrase structure annotation (TIGER-XML encoding)
DPS: Deepened phrase structure annotation (TIGER-XML encoding)
Dep: Dependency structure annotation (Malt-XML encoding)
The treebank can be downloaded from:
http://www.msi.vxu.se/users/nivre/research/Talbanken05.html
Joakim Nivre
Jens Nilsson
Johan Hall
Växjö University
School of Mathematics and Systems Engineering
More information about the Corpora
mailing list