[Corpora-List] New release: jTokeniser 1.2
Andy Roberts
andyr at comp.leeds.ac.uk
Thu Aug 4 21:51:15 UTC 2005
Hi all,
Because I recall recently someone looking for sentence segmentation
software, I thought I'd give a quick advertisement for jTokeniser...
I've just released jTokeniser 1.2. jTokeniser is an opensource Java
library to provide a simple framework for a variety of tokenisers. There
are six currently at your disposal:
* WhiteSpaceTokeniser - this splits a string on all occurances of
whitespace, which include spaces, newlines, tabs and linefeeds.
* StringTokeniser - this is basically the same as Java's
java.util.StringTokenizer with some extra methods (and extends from
Tokeniser). Its default behaviour is to act as a WhiteSpaceTokeniser,
however, you can specify a set of characters that are to be used to
indicate word delimiters.
* RegexTokeniser - this tokeniser is much more flexible as you can use
regular expressions to define a what a token is. So, "\\w+" means
whenever it matches one or more letters, it will consider that a word.
By default, it uses a regular expression equivalent to a whitespace
tokeniser.
* RegexSeparatorTokeniser - this can be thought of as an advanced
StringTokeniser. Whereas StringTokeniser is limited to defining
delimiters as a set of individual characters, RegexSeparatorTokeniser
can utilise regular expressions for a richer and more flexible
approach.
* BreakIteratorTokeniser - one of the most sophisticated of the lot,
although should only be used on natural language strings to isolate
words. It also comes with built-in rules about how to find words,
knowing how to disregard punctuation, etc.
* SentenceTokeniser - this also uses a BreakIterater like the above,
but tuned towards finding sentence boundaries. The "tokens" in this
tokeniser are in fact individual sentences.
Now, this is just a library at the moment so you obviously need to be a
Java programmer to utilise these tokenisers. Fortunately, they all
follow the same simple framework. The docs and sample code will make it
clearer. I do intend to create a GUI front-end to this library in the
future so that the tokenisers can be utilised in a stand-alone
application so the user need not be a Java programmer.
Full information available at jTokeniser homepage:
http://www.comp.leeds.ac.uk/andyr/software/jTokeniser/
Suggestions, comments and complaints welcome. :)
Regards,
Andy Roberts
--
Computer Vision and Language Research Group
School of Computing
University of Leeds,
Leeds, UK, LS2 9JT
http://www.comp.leeds.ac.uk/andyr
More information about the Corpora
mailing list