[Corpora-List] 'imperfect' corpora

Eric Ringger ringger at cs.byu.edu
Thu Nov 16 22:52:31 UTC 2006


Thanks to all for the interesting references.

As a Ph.D. student, I conducted some related research on the post-correction
of speech recognition results.  Here is the briefest noteworthy reference:

Eric K. Ringger and James F. Allen. "A Fertility Channel Model for
Post-Correction of Continuous Speech Recognition." Proceedings of the Fourth
International Conference on Spoken Language Processing (ICSLP'96).
Philadelphia, PA. October 1996.

http://www.cs.rochester.edu/u/ringger/research/icslp-96.html

As no automatic post-correction technique will itself be perfect, I agree
with Sravana Reddy that there is much to be said for corpus analysis
techniques that are robust to the errors which inevitably occur in the
process of automatic document acquisition (OCR, speech recognition, ...).

Many of the automatic post-correction techniques referenced in this thread
leverage common error instances and types.  One would expect robust corpus
analysis techniques at least to be able to see through the infrequent,
random errors.

Regards,
--Eric
http://faculty.cs.byu.edu/~ringger/ 

-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of Mirko Tavosanis
Sent: Thursday, November 16, 2006 10:25 AM
To: Hunter, Duncan; corpora at lists.uib.no
Subject: Re: [Corpora-List] 'imperfect' corpora

Hi, Duncan,

as for OCR problems, you can probably use:

1. Christoph Ringlstetter, Klaus U. Schulz and 
Stoyan Mihov: Orthographic Errors in Web Pages - 
Towards Cleaner Web Corpora. Computational Linguistics 32(3): 295-340.

2. Strohmaier, Christian, Christoph Ringlstetter,
Klaus U. Schulz, and Stoyan Mihov. 2003a.
Lexical postcorrection of OCR-results: The
web as a dynamic secondary dictionary?
In Proceedings of the Seventh International
Conference on Document Analysis and
Recognition (ICDAR 03), pages 1133-1137,
Edinburgh.

3. Strohmaier, Christian, Christoph Ringlstetter,
Klaus U. Schulz, and Stoyan Mihov.
A visual and interactive tool for
optimizing lexical postcorrection of
OCR results. In Proceedings of the IEEE
Workshop on Document Image Analysis
and Recognition, DIAR'03, Madison, WI.

4. Ringlstetter, Christoph. 2003. OCRKorrektur
und Bestimmung von
Levenshtein-Gewichten. Master's
thesis, LMU, University of Munich.



Mirko Tavosanis
Dipartimento di Studi italianistici
Universita' di Pisa
http://www.humnet.unipi.it/ital/ 



More information about the Corpora mailing list