[Corpora-List] Tools for historical languages?

Martin Reynaert reynaert at uvt.nl
Wed Nov 19 17:40:35 UTC 2008

Dear Stefanie,

I am working on a tool to perform spelling normalization for large 
corpora - contemporary or historical - in the framework of a project for 
the National Library in the Netherlands.

The tool is called TICCL (pronounce 'tickle') for: Text-Induced Corpus 
Clean-up. The prototype was described in:

Non-interactive OCR post-correction for giga-scale digitization projects

       Author(s): Martin Reynaert
       Reference: In A. Gelbukh (Ed.), Proceedings of the Computational 
Linguistics and Intelligent Text Processing 9th International 
Conference, CICLing 2008. Lecture Notes in Computer Science Vol. 
4919/2008, Berlin / Heidelberg: Springer, pp. 617-630.

The focus there was on OCR-misrecognition errors, but TICCL handles any 
kind of spelling variation. It is largely language-independent, but 
assumes an alphabet.

A production grade version should become available as free software 
sometime early next year. I intend to announce that event on this list.


Martin Reynaert
ILK (Induction of Linguistic Knowledge)
TiCC (Tilburg centre for Creative Computing)
University of Tilburg


Stefanie Dipper wrote:
> Dear all,
> I'm looking for tools for the analysis of historical languages, e.g. 
> sentence splitters, part-of-speech taggers, or spelling normalisers. I am 
> working on German texts (diplomatic transcriptions) from the 11th-16th 
> centuries, but I'd be interested in tools for any historical language, and 
> tools for languages that lack a standardised spelling such as dialects.
> Thank you for any help,
> Stefanie
> --
> Jun.-Prof. Dr. Stefanie Dipper
> Sprachwiss. Institut, Ruhr-Universitaet Bochum
> D - 44780 Bochum, Germany
> http://www.linguistics.ruhr-uni-bochum.de/~dipper
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

Corpora mailing list
Corpora at uib.no

More information about the Corpora mailing list