[Corpora-List] Tools for historical languages?
Martin Reynaert
reynaert at uvt.nl
Wed Nov 19 17:40:35 UTC 2008
Dear Stefanie,
I am working on a tool to perform spelling normalization for large
corpora - contemporary or historical - in the framework of a project for
the National Library in the Netherlands.
The tool is called TICCL (pronounce 'tickle') for: Text-Induced Corpus
Clean-up. The prototype was described in:
Non-interactive OCR post-correction for giga-scale digitization projects
Author(s): Martin Reynaert
Reference: In A. Gelbukh (Ed.), Proceedings of the Computational
Linguistics and Intelligent Text Processing 9th International
Conference, CICLing 2008. Lecture Notes in Computer Science Vol.
4919/2008, Berlin / Heidelberg: Springer, pp. 617-630.
The focus there was on OCR-misrecognition errors, but TICCL handles any
kind of spelling variation. It is largely language-independent, but
assumes an alphabet.
A production grade version should become available as free software
sometime early next year. I intend to announce that event on this list.
Greetings,
Martin Reynaert
ILK (Induction of Linguistic Knowledge)
TiCC (Tilburg centre for Creative Computing)
University of Tilburg
http://ilk.uvt/nl
Stefanie Dipper wrote:
> Dear all,
>
> I'm looking for tools for the analysis of historical languages, e.g.
> sentence splitters, part-of-speech taggers, or spelling normalisers. I am
> working on German texts (diplomatic transcriptions) from the 11th-16th
> centuries, but I'd be interested in tools for any historical language, and
> tools for languages that lack a standardised spelling such as dialects.
>
> Thank you for any help,
> Stefanie
>
> --
> Jun.-Prof. Dr. Stefanie Dipper
> Sprachwiss. Institut, Ruhr-Universitaet Bochum
> D - 44780 Bochum, Germany
> http://www.linguistics.ruhr-uni-bochum.de/~dipper
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list