Corpora: Summary on available syntactically parsed corpora

Rene.Valdes at lhsl.com Rene.Valdes at lhsl.com
Thu Aug 16 16:40:11 UTC 2001


Dear list members,

As requested by some of the respondents, I'd like to summarize the
responses I got to my inquiry on available syntactically parsed (treebank)
corpora for English, French, German, and other languages.  As reflected
below, there are a few good options for English and German, as well as
Chinese.  However, I did not receive any reply and could not locate any
such corpus for French.  Since we are about to embark on a project that
would benefit from the availability of such a corpus, I'd really appreciate
any information about French treebanks of any size and style.  And now on
to the summary:

1. ICE-GB corpus (British English)

The ICE-GB corpus is a 1m-word corpus of British English, fully parsed for
clause & phrase structure. For more info see:
http://www.ucl.ac.uk/english-usage/ice-gb/index.htm

Reply from:    Dr Gerald Nelson,
          Research Assistant Professor,
          Department of English,
          The University of Hong Kong,
          Pokfulam Road,
          Hong Kong SAR.

          Email: ganelson at hkucc.hku.hk
          Phone: (852) 2241-5141
          Fax: (852) 2559-7139
          http://www.hku.hk/english/staff/ganelson.htm

2.  TIGER project (German)

In the TIGER project we are creating a large syntactically annotated
corpus of German newspaper text. A corpus sampler will be released this
month:
http://www.ims.uni-stuttgart.de/projekte/TIGER/

My task is to develop a search tool for syntactically annotated corpora
- a first beta version will be released in October, the final version in
November.

Reply from:    Wolfgang Lezius                 lezius at ims.uni-stuttgart.de
          IMS, University of Stuttgart    Tel.: +49 +711 121-1374
          Azenbergstr. 12                 Fax:  +49 +711 121-1366
          D-70174 Stuttgart
          Germany

3.  NEGRA corpus (German)

The German ``NEGRA Corpus'', consists of parsed newspaper texts.
See http://www.coli.uni-sb.de/sfb378/negra-corpus/

Reply from:    Thorsten Brants
          brants at parc.xerox.com

4.  Verbmobil treebanks (German, English, Japanese)

We could help you with treebanks for English and German (and to some
degree for Japanese). They were developed in Tuebingen in the framework
of Verbmobil, a speech-to-speech translation project. For this reason,
the treebanks contain spontaneous speech data in the domains scheduling
of business appointments, travel scheduling, and hotel reservations.

The English treebank contains ca. 30,000 sentences, the German treebank
ca. 38,000 sentences. The Japanese treebank is somewhat smaller, it
contains ca. 18,000 sentences. The annotations for all treebanks cover
the levels of morpho-syntax, syntactic phrase structure, and
function-argument structure. The annotation schemes are purely
context-free, i.e. they do not contain crossing branches or traces.

Additionally, for each treebank, there exists an extensive stylebook,
which describes how different phenomena are annotated.

As the treebanks are only becoming available now (due to project
restrictions), I am not sure what the license conditions for commercial
use will be.

Reply from:    Sandra Kuebler
          University of Tuebingen
          Computational Linguistics
          Wilhelmstr. 113
          D-72074 Tuebingen
          Germany
          phone: +49-7071-2978490
          fax: +49-7071-551335
          email: kuebler at sfs.nphil.uni-tuebingen.de
          URL: http://www.sfs.nphil.uni-tuebingen.de/~kuebler/

5.  BLLIP99 corpora

Are you aware of the BLLIP99 corpora distributed by LDC?  30 million
words of WSJ text, machine parsed and coreferenced.

Reply from:    Eugene Charniak
          ec at bohr.cs.brown.edu

6.  Various links to check

You may want to check the list archives at:
http://www.hit.uib.no/corpora/
In case no one answers.

Also, the largest collection of corpora I know of is from The Linguistic
Data Consortium
http://www.ldc.upenn.edu/

Chris Manning also has an extensive list of links to corpus resources
http://www-nlp.stanford.edu/links/statnlp.html#Corpora

Reply from:    Daniel Walker
          Mendez, Inc.
          dwalker at lhsl.com

7.  Chinese Penn Treebank

This one is also available from LDC and contains about 100K words (4185
sentences from 325 articles from Xinhua newswire between 1994 and 1998).
It was parsed following the general methodology of the Penn Treebank.  It
costs $100.
See http://www.ldc.upenn.edu/Catalog/LDC2000T48.html

(I obtained this information by looking through the LDC catalog.)

Again, any information on syntactically parsed French corpora would be
greatly appreciated.


René J. Valdés
Mendez, Inc.
San Diego, California
USA
http://www.mendez.com
rvaldes at lhsl.com
1-858-737-5216


More information about the Corpora mailing list