16.3291, FYI: German Treebanks; Corpus of Written Italian

Wed Nov 16 00:41:41 UTC 2005

LINGUIST List: Vol-16-3291. Tue Nov 15 2005. ISSN: 1068 - 4875.

Subject: 16.3291, FYI: German Treebanks; Corpus of Written Italian

Moderators: Anthony Aristar, Wayne State U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org) 
        Sheila Dooley, U of Arizona  
        Terry Langendoen, U of Arizona  

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: Svetlana Aksenova <svetlana at linguistlist.org>
================================================================  

To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.

===========================Directory==============================  

1)
Date: 14-Nov-2005
From: Heike Zinsmeister < heike.zinsmeister at uni-tuebingen.de >
Subject: German Treebanks 

2)
Date: 14-Nov-2005
From: Pier Marco Bertinetto < Bertinetto at sns.it >
Subject: Corpus and Frequency Lexicon of Written Italian 

-------------------------Message 1 ---------------------------------- 
Date: Tue, 15 Nov 2005 19:36:19
From: Heike Zinsmeister < heike.zinsmeister at uni-tuebingen.de >
Subject: German Treebanks 

The Division of Computational Linguistics at the Seminar fuer
Sprachwissenschaft of the University of Tuebingen (Germany) is happy to
announce the release of two German language resources:

* The Tuebingen Treebank of Spoken German (TueBa-D/S)
* The Tuebingen Treebank of Written German (TueBa-D/Z) 
  - second release

Both treebanks have the same basic annotation scheme which
distinguishes four levels of syntactic constituency: the lexical level, the
phrasal level, the level of topological fields, and the clausal level. In
addition to constituent structure, annotated trees contain edge labels
between nodes which encode grammatical functions.

Both treebanks are available in 3 different formats:
   * NEGRA export format
   * XML format
   * Penn Treebank format

The treebanks in detail:

1. The Tuebingen Treebank of Spoken German (TueBa-D/S)

The TueBa-D/S treebank was annotated in the project Verbmobil, a longterm
Machine Translation project for spontaneous speech funded by the German
Ministry for Education, Science, Research, and Technology (BMBF). This is
the first public release of the treebank.

TueBa-D/S is a syntactically annotated corpus based on spontaneous
dialogues, which were manually transliterated. The treebank comprises
approximately 38 000 sentences (ca. 360 000 words). The syntactic
annotation was also performed manually.

The license for TueBa-D/S is granted free of charge for scientific use. For
more information, please refer to:
http://www.sfs.uni-tuebingen.de/en_tuebads.shtml

2. The Tuebingen Treebank of Written German (TueBa-D/Z) - second release

The TueBa-D/Z treebank is a manually annotated, German newspaper corpus
based on data taken from the daily issues of the 'die tageszeitung'. It
currently comprises approximately 22 000 sentences (ca. 380 000 words).

The annotation scheme is an extended version of the TueBa-D/S annotation
scheme. It accounts for a larger number of linguistic phenomena and is
enriched at two levels: (multi-word) named entities are marked at the
phrasal level; words are annotated with inflectional morphology at the
lexical level (currently ca. 70% of the sentences are covered).

What is new in the second release:

- about 6 800 additional sentences
- morphological information
- cleaner versions of the trees published in the first release

The license for TueBa-D/Z is granted free of charge for scientific use. For
more information, please refer to:
http://www.sfs.uni-tuebingen.de/en_tuebadz.shtml

With best regards,

Erhard W. Hinrichs
Sandra Kübler
Heike Zinsmeister 
-------------------------------------------------------

For your information:

A related resource is The Tuebingen Partially Parsed Corpus of
Written German (TuePP-D/Z), released 12/2003.

TuePP-D/Z is a 200 million word collection of articles from the taz
newspaper which have been automatically annotated with clause structure,
topological fields, and chunks, in addition to more low level annotation
including parts of speech and morphological ambiguity classes.

For more information, please refer to:
http://www.sfs.uni-tuebingen.de/en_tuepp.shtml 

Linguistic Field(s): Computational Linguistics
                     Syntax
                     Text/Corpus Linguistics

-------------------------Message 2 ---------------------------------- 
Date: Tue, 15 Nov 2005 19:36:23
From: Pier Marco Bertinetto < Bertinetto at sns.it >
Subject: Corpus and Frequency Lexicon of Written Italian 

We are glad to announce a new lexical resource:

CoLFIS (Corpus e Lessico di Frequenza dell'Italiano Scritto)
[Corpus and Frequency Lexicon of Written Italian]

produced by

Pier Marco Bertinetto°, Cristina Burani*, Alessandro Laudanna^*,
Lucia Marconi+, Daniela Ratti+, Claudia Rolando+, Anna Maria Thornton§

° Scuola Normale Superiore, Pisa
* Istituto di Scienze e Tecnologie della Cognizione, CNR, Roma
^ Università di Salerno
+  Istituto di Linguistica Computazionale,Unità Staccata di Genova, CNR, Genova
§ Università de L'Aquila

The reference corpus consists of excerpts from newspapers, magazines and
books. It includes 3.150.075 lexical occurrences. The corpus was designed
as the best approximation to the Italians' average preferred readings, as
mirrored by official statistics.

The lexicon consists of two main components: the forms repertoire and the
lemmas repertoire. In the latter, all identical forms belonging to
different lemmas are disambiguated, while syntagmatic words (such as
table's leg) are treated as single entries.

The lexical lists (both forms and lemmas) are presently available for free
download at:

http://alphalinguistica.sns.it/BancheDati.htm
http://www.istc.cnr.it/material/database/colfis/

They are organized according to a number of possibilities: frequency rank,
inverse alphabetical ordering, with or without capital / non-capital
distinction, etc. The entire corpus is not yet available. We hope to put it
on-line as soon as we obtain the necessary authorizations.

The work has been produced with CNR (Consiglio Nazionale delle Ricerche)
support.

With the help of willing users, this product will hopefully be enriched
with further facilities. 

Linguistic Field(s): Text/Corpus Linguistics

-----------------------------------------------------------
LINGUIST List: Vol-16-3291