[Corpora-List] converting PDFs to ASCII or text-only files without clumps

Angus B. Grieve-Smith grvsmth at panix.com
Wed Jun 16 21:36:48 UTC 2010


maxwell wrote:
> I believe the OP said (s?)he was getting text out of them, but the word
> boundaries were often missing.  If so, then their PDFs obviously have text
> in them, they're not just images.  You can't get strings of text, with or
> without word boundaries, out of image PDFs.
>   
    You sure can if you're using optical character recognition 
software.  And OCR output often has missing word boundaries, especially 
if the software is not very good.

    I'd like to know if this is an OCR issue.  Out of PDFs that are not 
just collections of images, in most of the ones I've seen the text flows 
well and has word boundaries.

-- 
				-Angus B. Grieve-Smith
				grvsmth at panix.com


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list