Corpora: Fast Transformation-Based Toolkit
Radu Florian
rflorian at cs.jhu.edu
Thu Oct 11 21:58:41 UTC 2001
The fnTBL Toolkit
-----------------
The Natural Language Processing Group from Johns Hopkins University is
happy to announce the availability of fnTBL 1.0, a fast implementation
of Transformation-Based Learning (TBL).
Transformation-based learning is an error-driven machine learning
technique which functions by first assigning the most likely class to
samples, and then iteratively selecting and applying the transformation
rule which results in the maximum reduction of the error rate.
The fnTBL toolkit is designed for large, dynamic classification tasks
like the ones that are common in Natural Language Processing, such as
part-of-speech tagging, base noun phrase chunking or word sense
disambiguation, but can be used to perform any classification task
with symbolic features. fnTBL improves the running time dramatically
compared with the original TBL algorithm proposed by Eric Brill,
obtaining a speed-up of up to 2 orders of magnitude, while maintaining
the same performance.
Some of the features of the fnTBL toolkit:
- it supports a large number of symbolic features and feature types
(including bag-of-words-type features, identity features, subword
features, prefix/suffix features, etc);
- it has a flexible architecture, with feature types being easy to
create, add, remove or modify, which makes the toolkit useful in
rapidly deploying a classifier for a particular task;
- new tasks are easy to set-up - a large pool of feature types is
already implemented and some Perl tools for data processing are provided;
- basic NLP tasks for English (part-of-speech tagging, base noun
phrase and text chunking) are already trained and are part of the
distribution; others (e.g. Swedish part-of-speech) can be downloaded
from the web site.
- multitask, simultaneous classification is supported (e.g. learn to
perform word segmentation together with POS tagging for Chinese).
- the resulting rules often carry easy-to-understand linguistic content,
which can offer insight into the the problem's behavior.
=========
Download
=========
fnTBL version 1.0 is public domain software and can be downloaded from
the main web site:
http://nlp.cs.jhu.edu/~rflorian/fntbl/index.html
When downloading the software, you will be invited to join the fnTBL
mailing list, at fnTBLtk at nlp.cs.jhu.edu .
For more information about fnTBL, please refer to the documentation
at:
http://nlp.cs.jhu.edu/~rflorian/fntbl/documentation.html
The documentation can also be downloaded separately as a postscript or
PDF file from:
http://nlp.cs.jhu.edu/~rflorian/fntbl/fnTBL-toolkit.ps.gz
or
http://nlp.cs.jhu.edu/~rflorian/fntbl/fnTBL-toolkit.pdf.gz
The software package contains the C++ sources of the program and a
number of useful Perl scripts, including an almost turn-key solution
for training and/or testing a POS tagger. A small number of test cases
and three rule pre-trained systems (English POS tagging, English Base
NP chunking and English Text Chunking) are also provided. The software
ise easy to set up on most Unix systems; it has also been tested on a
Windows(Cygwin) system.
We hope that the fnTBL toolkit will prove useful to you,
Radu Florian and Grace Ngai
Natural Language Processing Group
Johns Hopkins University
More information about the Corpora
mailing list