[Corpora-List] Natural Language Toolkit (NLTK) Version 0.9 has been released

Steven Bird sb at csse.unimelb.edu.au
Wed Oct 17 00:20:10 UTC 2007


NLTK-Lite version 0.9 has been released -- http://nltk.org/index.php

NLTK -- the Natural Language Toolkit -- is a suite of open source
Python modules, data and documentation for research and development in
natural language processing. NLTK contains code supporting dozens of
NLP tasks, along with 30 popular corpora and extensive documentation
including a 360-page online book.  The toolkit has been used in 50+
university courses in over 15 countries, and is in the top 0.1% of SourceForge
projects (32,000 downloads in the past 12 months).

Contents: NLTK consists of over 50k lines of Python code and 480Mb of data:

Corpora: Treebanks (English, Chinese, Dutch, Catalan, Spanish, Portuguese);
    POS-tagged corpora including the Brown Corpus; text corpora;
    PP attachment, named entity, WSD, TIMIT sample,
    Chat-80 database, WordNet, CMU Pronunciation Dictionary.
Tokenizers: whitespace, newline, blankline, word, wordpunct,
    treebank, regexp, Punkt sentence segmenter
Stemmers: Porter, Lancaster, regexp
Taggers: regexp, n-gram, backoff, Brill, HMM
Parsers: recursive descent, shift-reduce, chunk, chart,
    feature-based, probabilistic, ...
Semantic interpretation: untyped lambda calculus,
    first-order models, parser interface
Wordnet: wordnet interface, lexical relations, similarity
Classifiers: decision tree, maximum entropy, naive Bayes, Weka interface
Clusterers: expectation maximization, agglomerative, k-means
Evaluation: accuracy, precision, recall, F-measure, windowdiff
Estimation: uniform, maximum likelihood, Lidstone, Laplace,
    expected likelihood, heldout, cross-validation, Good-Turing, Witten-Bell
Miscellaneous: feature detection, unification, chatbots, many utilities

Changes: Version 0.9 is substantially revised and expanded from version 0.8.
The entire toolkit can be accessed via a single import statement
"import nltk", and there is a more convenient naming scheme. Calling
deprecated functions generates messages that help programmers update
their code. The corpus, tagger, and classifier modules have been
redesigned. All functionality of the old NLTK 1.4.3 is now covered by
NLTK-Lite 0.9. The book has been revised and expanded. A new data
package incorporates the existing corpus collection and contains new
sections for pre-specified grammars and pre-computed models. Several
new corpora have been added, including treebanks for Portuguese,
Spanish, Catalan and Dutch. A Macintosh distribution is provided.  For
full details of the changes, please see:
http://nltk.svn.sourceforge.net/viewvc/*checkout*/nltk/trunk/nltk/ChangeLog

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list