[Corpora-List] SCANNED TEXTS ARE VALID FOR CORPORA PURPOSES?

Dan Cristea dcristea at info.uaic.ro
Fri Aug 1 17:20:22 UTC 2008


Dear J.L. and Geoffrey,

In a sense Geoffrey is certainly right, in another he is, fortunately, I 
would say, not. I want to mention a technology we are developing now 
which makes heavy use of scanned texts. In this project the intention is 
to link all citations given as examples to word senses in the Thesaurus 
Dictionary of Romanian language (more than 15,000 pages, about 175,000 
entries and more than 1,300,000 examples) onto its sources (more than 
3,300 volumes, summing up to something like 1,300,000 pages). The 
project intends to scan and OCR these resources and use the OCRed texts 
as a bridges through which citations in the dictionary can be 
recuperated in the image pages of the original editions. The user 
browsing for citations and contexts in the original editions will not be 
aware that an imperfect version, as resulted after OCR, is 
intermediating his search. So, in a sense, scanned images can be 
retrieved as results of concordancers, if they are intermediated by OCR. 
This way, the search will not be exhaustive of course, because some 
occurences can be lost due to errors in OCRing, but still very 
satisfactory if the corpus is big.

Regards,
Dan


williams wrote:
> Dear J.L
>
> If they are not OCRed, I fail to see how you will use a concordancer on
> them, and such tools are really the mainstay of corpus linguistics. In
> some senses of the word 'corpus' they could be considered a 'corpus',
> that is a collection of texts, but in corpus linguistics a corpus needs
> to be queriable with a concordancer.
>
> Best
>
> Geoffrey
>
>
> Le jeudi 31 juillet 2008 à 13:38 -0700, J.L. DeLucca a écrit :
>   
>> Dear friends,
>>
>> In the digital world there are the digital libraries like the "
>> Gallica, Bibliothèque nationale de France digital library "
>> that works with scanned texts NO OCR treatment or the Ebook projects
>> that works wirh full texts. well,I want to know if you would consider
>> scanned texts NO OCR treatment as digital corpora, especially oldest
>> texts.
>>
>> Thank you for your advice.
>>
>> J.L. De Lucca
>>
>>
>>
>>
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>     
>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>   


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list