[Corpora-List] Natural Language Toolkit: NLTK-Lite version 0.6.5 released
Hamish Cunningham
hamish at dcs.shef.ac.uk
Tue Jul 11 10:41:26 UTC 2006
Markus,
You might try the unicode-based tokeniser included with GATE
(http://gate.ac.uk), or ask on the user list for a German specialisation of
it.
Best
--
Hamish
http://www.dcs.shef.ac.uk/~hamish/
Markus Heller wrote:
> Dear Corpora Community,
>
> I recently saw that the tokenizer from the nltk package requires a good regex.
> Does anybody have a reasonable regex for this package which can produce
> decent tokens from modern texts, preferably German texts? I have tried out
> the ones on the tutorial pages but I see a common package user is required to
> develop his own regex for tokenizing purposes. Are there good (free)
> tokenizer regexes around for this package?
>
> Thanks in advance,
> Markus
>
>
>
More information about the Corpora
mailing list