[Corpora-List] New suite of linguistically-motivated NLP tools available

Fri Jun 8 19:45:46 UTC 2007

Version 1.0 of the C&C language processing tools is now freely
available for research use.

A feature of the tools is the combination of robust, efficient,
wide-coverage language processing with detailed linguistic output. We
have used the tools to analyse the entire Gigaword corpus (1 billion
words) in only 5 days using 18 processors. This speed of analysis,
robustness and wide-coverage, combined with a high level of linguistic
detail, represents a breakthrough in NLP technology.

The tools comprise:

* Wide-coverage parser based on Combinatory Categorial Grammar. The
   parser recovers labelled *predicate-argument dependencies* from
   CCGbank at over 85% F-score, and from Depbank at over 81% F-score.

* The semantics tool Boxer which takes the output of the C&C parser
   and taggers and produces output in the form of Discourse
   Representation Structures.

* A number of Maximum Entropy taggers: POS tagger, chunker, CCG
   supertagger, and Named Entity Recognizer.

The tools are written in C++ (except Boxer, which is in Prolog) and
have been designed for large-scale NLP tasks which require
sophisticated linguistic processing.

The parser and taggers have been developed by James R. Curran and
Stephen Clark. Boxer has been developed by Johan Bos.

The tools, including source code, can be downloaded from:
http://svn.ask.it.usyd.edu.au/trac/candc/wiki

--
Stephen Clark
Lecturer in Computer Science
Oxford University Computing Laboratory
Fellow of Keble College
Member of the Oxford Computational Linguistics Group

http://www.comlab.ox.ac.uk/stephen.clark/