[Corpora-List] Talbanken05 (Swedish Treebank)

Joakim Nivre nivre at msi.vxu.se
Wed Nov 23 07:54:28 UTC 2005


We are happy to announce the release of Talbanken05 (Version 1.0).

Talbanken05 is a modernized version of Talbanken76, a Swedish treebank of
roughly 300,000 words, constructed at Lund University in the 1970s.
The treebank comes with no guarantee but is freely available for research
and educational purposes as long as proper credit is given for the work
done to produce the material (both in Lund and in Växjö).

The treebank consists of a written language part and a spoken language
part of roughly equal size. The written language part in turn consists of
two sections, the so-called professional prose section (P), with data
from textbooks, brochures, newspapers, etc., and a collection of high
school students' essays (G). The spoken language part also has two
sections, interviews (IB) and conversations and debates (SD). Altogether,
the corpus contains close to 300,000 running tokens.

The distribution contains the entire treebank (divided into sections
P, G, IB and SD) in four versions:

  MAMBA: Original syntactic and lexical annotation (and encoding)
  FPS:   Flat phrase structure annotation (TIGER-XML encoding)
  DPS:   Deepened phrase structure annotation (TIGER-XML encoding)
  Dep:   Dependency structure annotation (Malt-XML encoding)

The treebank can be downloaded from:
http://www.msi.vxu.se/users/nivre/research/Talbanken05.html

Joakim Nivre
Jens Nilsson
Johan Hall

Växjö University
School of Mathematics and Systems Engineering



More information about the Corpora mailing list