[Corpora-List] converting PDFs to ASCII or text-only files without clumps

maxwell maxwell at umiacs.umd.edu
Wed Jun 16 21:58:23 UTC 2010


On Wed, 16 Jun 2010 17:36:48 -0400, "Angus B. Grieve-Smith"
<grvsmth at panix.com> wrote:
> maxwell wrote:
>> I believe the OP said (s?)he was getting text out of them, but the word
>> boundaries were often missing.  If so, then their PDFs obviously have
>> text in them, they're not just images.  You can't get strings of text, 
>> with or without word boundaries, out of image PDFs.
>>   
>     You sure can if you're using optical character recognition 
> software.  And OCR output often has missing word boundaries, especially 
> if the software is not very good.
> 
>     I'd like to know if this is an OCR issue.  

Not for me, I have no OCR software.  This is just selecting text with a
mouse and copying it to the clipboard.  It happens on several PDF viewers.

   Mike Maxwell

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list