[Corpora-List] Tools for historical languages?

Thu Nov 20 15:45:16 UTC 2008

Dear Stefanie,

We have been working for the past few years on a tool for normalising
spelling in historical corpora (particularly Early Modern English) named
VARD 2. The tool can be used to manually and automatically standardise texts
or an entire corpus. Variants are replaced with modern equivalents by the
tool, with xml tags used to retain the original spelling. The tool also
learns which replacement methods are most effective, so training the tool on
a relatively small sample will result in improved standardisation of a
particular corpus.

The tool was developed for Early Modern English, however by plugging in
other dictionaries and through training, the tool can be used with other
languages and varieties.

Further details of our research are available at
http://ucrel.lancs.ac.uk/VariantSpelling/ and the tool itself is available
to use for free (for academic use) from:
http://www.comp.lancs.ac.uk/~barona/vard2/, further details and a user guide
are also available.

We've also recently completed studies investigating the effect of spelling
variation on corpus linguistic techniques:

For keyword analysis: Baron, A., Rayson, P. and Archer, D. (forthcoming).
Word frequency and key word statistics in historical corpus linguistics.
International Journal of English Studies.

And for part-of-speech tagging: Rayson, P., Archer, D., Baron, A., Culpeper,
J. and Smith, N. (2007). Tagging the Bard: Evaluating the accuracy of a
modern POS tagger on Early Modern English corpora. In Davies, M., Rayson,
P., Hunston, S. and Danielsson, P. (eds.) Proceedings of the Corpus
Linguistics Conference: CL2007, University of Birmingham, UK, 27-30 July
2007.

Both studies quantify the effect of spelling variation on corpus linguistic
studies. The former paper also quantifies the levels of spelling variation
in various Early Modern English corpora including Early English Books
Online.

Please get in touch if you require more details.

Regards,
Alistair Baron

________________________
Alistair Baron
C28, Computing Department,
Infolab 21, South Drive,
Lancaster University,
Lancaster,
LA1 4WA

T: +44(0) 15245 10348
E: a.baron at comp.lancs.ac.uk
-----Original Message-----
From: Stefanie Dipper <dipper at linguistics.rub.de>
Date: 2008/11/19
Subject: [Corpora-List] Tools for historical languages?
To: CORPORA at uib.no

Dear all,

I'm looking for tools for the analysis of historical languages, e.g.
sentence splitters, part-of-speech taggers, or spelling normalisers. I am
working on German texts (diplomatic transcriptions) from the 11th-16th
centuries, but I'd be interested in tools for any historical language, and
tools for languages that lack a standardised spelling such as dialects.

Thank you for any help,
Stefanie

--
Jun.-Prof. Dr. Stefanie Dipper
Sprachwiss. Institut, Ruhr-Universitaet Bochum
D - 44780 Bochum, Germany
http://www.linguistics.ruhr-uni-bochum.de/~dipper

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora