Prague Dependency Treebank

Daniel Zeman zeman at ufal.ms.mff.cuni.cz
Thu Oct 15 17:48:14 UTC 1998


The Institute of Formal and Applied Linguistics (UFAL) at the Charles
University, Prague, proudly announces that the first version of the
PRAGUE DEPENDENCY TREEBANK has been made available to the research
community.

The Prague Dependency Treebank (PDT) is a morphologically and
syntactically annotated corpus of Czech as a representative of
inflectionally rich free-word-order languages. (E.g., all the Slavic
languages such as Russian, Polish, Serbo-Croatian and many others
spoken together by more than 350 million people have similar typological
properties as Czech in both morphology and syntax.) The current version
of PDT (0.5) contains 456705 tokens (words+punctuation) in 26610
sentences and 576 files. For keeping results of NLP applications
comparable the data has been divided into a training set (19126
sentences), a development test set (3697 sentences) and a
(cross-)evaluation test data set (3787 sentences).

The Prague Dependency Treebank is - to a certain extent - modelled after
the Penn Treebank but it uses the dependency syntax representation of
sentences. It has three layers:

  1.morphological (uses word forms, tags, lemmas)
  2.analytical, or surface syntax (uses dependencies and analytical
     functions of dependencies)
  3.tectogrammatical, which captures linguistic meaning (contains
     tectogrammatical functions such as Actor, Patient, Addressee, etc.)

The Prague Dependency Treebank is a long-term project which should end
in the year 2000. At the moment (October 1998) we have at our disposal
roughly half the material (at levels 1 and 2) while the level 3 is still
in the specification phase and rules of transition between the
representations on level 2 and level 3 are being formulated. The current
version is thus preliminary and identified as "PDT version 0.5"
(reflecting
mostly the amount of material currently available).

The text material contains samples from the following sources:

  1.Lidove noviny (daily newspapers), 1991, 1994, 1995
  2.Mlada fronta Dnes (daily newspapers), 1992
  3.Ceskomoravsky Profit (business weekly), 1994
  4.Vesmir (scientific magazine), Academia Publishers, 1992, 1993

The electronic source has been provided by the Institute of the Czech
National Corpus, in a format jointly developed by the ICNK and UFAL.

The Treebank has been supported by the following grants and projects:

    Grant Agency of the Czech Republic No. 405/96/0198 
       (Treebank Definition and Procedures Specification)
    Grant Agency of the Czech Republic No. 405/96/K214 
       (Tools and Level 1 Annotation)
    Ministry of Education of the Czech Republic Project No. VS96151
       (Tools and Structural Annotation on the Level 2)
    National Science Foundation grant No. #IIS-9732388 
       (Version 0.5 Preparation for the Workshop 98)

The documentation of PDT is linked from its main page at UFAL. Go to the
UFAL home page, http://ufal.ms.mff.cuni.cz/, then click on "Projects"
and "Treebank".

The PDT Version 0.5 is freely available for research purposes providing
you fill in and submit a licence agreement. The appropriate form is also
linked from the PDT web page.



-- 
Daniel Zeman, UFAL MFF UK, Praha
zeman at ufal.mff.cuni.cz
http://www.ms.mff.cuni.cz/~zeman/



More information about the LFG mailing list