[Corpora-List] SCANNED TEXTS ARE VALID FOR CORPORA PURPOSES?
Dan Cristea
dcristea at info.uaic.ro
Fri Aug 1 17:20:22 UTC 2008
Dear J.L. and Geoffrey,
In a sense Geoffrey is certainly right, in another he is, fortunately, I
would say, not. I want to mention a technology we are developing now
which makes heavy use of scanned texts. In this project the intention is
to link all citations given as examples to word senses in the Thesaurus
Dictionary of Romanian language (more than 15,000 pages, about 175,000
entries and more than 1,300,000 examples) onto its sources (more than
3,300 volumes, summing up to something like 1,300,000 pages). The
project intends to scan and OCR these resources and use the OCRed texts
as a bridges through which citations in the dictionary can be
recuperated in the image pages of the original editions. The user
browsing for citations and contexts in the original editions will not be
aware that an imperfect version, as resulted after OCR, is
intermediating his search. So, in a sense, scanned images can be
retrieved as results of concordancers, if they are intermediated by OCR.
This way, the search will not be exhaustive of course, because some
occurences can be lost due to errors in OCRing, but still very
satisfactory if the corpus is big.
Regards,
Dan
williams wrote:
> Dear J.L
>
> If they are not OCRed, I fail to see how you will use a concordancer on
> them, and such tools are really the mainstay of corpus linguistics. In
> some senses of the word 'corpus' they could be considered a 'corpus',
> that is a collection of texts, but in corpus linguistics a corpus needs
> to be queriable with a concordancer.
>
> Best
>
> Geoffrey
>
>
> Le jeudi 31 juillet 2008 à 13:38 -0700, J.L. DeLucca a écrit :
>
>> Dear friends,
>>
>> In the digital world there are the digital libraries like the "
>> Gallica, Bibliothèque nationale de France digital library "
>> that works with scanned texts NO OCR treatment or the Ebook projects
>> that works wirh full texts. well,I want to know if you would consider
>> scanned texts NO OCR treatment as digital corpora, especially oldest
>> texts.
>>
>> Thank you for your advice.
>>
>> J.L. De Lucca
>>
>>
>>
>>
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list