[Corpora-List] SCANNED TEXTS ARE VALID FOR CORPORA PURPOSES?

Mon Aug 4 09:11:05 UTC 2008

The method Dan outlines is fine for its usual application -- in for 
example the searching through thousands of pages of old newspapers for 
references to some politician or some historically significant phrase it 
really doesn't matter if you miss a few because the OCR was less than 
brilliant.  The theoretical problem I have with this approach is the 
simple observation that OCR errors are not random: some letters are 
harder to distinguish than others, and some misreadings are more 
frequent than others. This matters if you are trying to build a sound 
statistical model of token frequencies in the original source.  To make 
matters even more confusing, these facts are well known to the 
manufacturers of good OCR systems, so their software will incorporate 
statistical procedures to "enhance" the results according to 
expectations, derived from standardised corpora. Which may be bad news 
for those who are planning to produce such models...

Dan Cristea wrote:
> Dear J.L. and Geoffrey,
>
> In a sense Geoffrey is certainly right, in another he is, fortunately, I 
> would say, not. I want to mention a technology we are developing now 
> which makes heavy use of scanned texts. In this project the intention is 
> to link all citations given as examples to word senses in the Thesaurus 
> Dictionary of Romanian language (more than 15,000 pages, about 175,000 
> entries and more than 1,300,000 examples) onto its sources (more than 
> 3,300 volumes, summing up to something like 1,300,000 pages). The 
> project intends to scan and OCR these resources and use the OCRed texts 
> as a bridges through which citations in the dictionary can be 
> recuperated in the image pages of the original editions. The user 
> browsing for citations and contexts in the original editions will not be 
> aware that an imperfect version, as resulted after OCR, is 
> intermediating his search. So, in a sense, scanned images can be 
> retrieved as results of concordancers, if they are intermediated by OCR. 
> This way, the search will not be exhaustive of course, because some 
> occurences can be lost due to errors in OCRing, but still very 
> satisfactory if the corpus is big.
>
> Regards,
> Dan
>
>
> williams wrote:
>   
>> Dear J.L
>>
>> If they are not OCRed, I fail to see how you will use a concordancer on
>> them, and such tools are really the mainstay of corpus linguistics. In
>> some senses of the word 'corpus' they could be considered a 'corpus',
>> that is a collection of texts, but in corpus linguistics a corpus needs
>> to be queriable with a concordancer.
>>
>> Best
>>
>> Geoffrey
>>
>>
>> Le jeudi 31 juillet 2008 à 13:38 -0700, J.L. DeLucca a écrit :
>>   
>>     
>>> Dear friends,
>>>
>>> In the digital world there are the digital libraries like the "
>>> Gallica, Bibliothèque nationale de France digital library "
>>> that works with scanned texts NO OCR treatment or the Ebook projects
>>> that works wirh full texts. well,I want to know if you would consider
>>> scanned texts NO OCR treatment as digital corpora, especially oldest
>>> texts.
>>>
>>> Thank you for your advice.
>>>
>>> J.L. De Lucca
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>>     
>>>       
>>
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>   
>>     
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>   

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora