[Corpora-List] converting PDFs to ASCII or text-only files without clumps
Angus B. Grieve-Smith
grvsmth at panix.com
Wed Jun 16 21:36:48 UTC 2010
maxwell wrote:
> I believe the OP said (s?)he was getting text out of them, but the word
> boundaries were often missing. If so, then their PDFs obviously have text
> in them, they're not just images. You can't get strings of text, with or
> without word boundaries, out of image PDFs.
>
You sure can if you're using optical character recognition
software. And OCR output often has missing word boundaries, especially
if the software is not very good.
I'd like to know if this is an OCR issue. Out of PDFs that are not
just collections of images, in most of the ones I've seen the text flows
well and has word boundaries.
--
-Angus B. Grieve-Smith
grvsmth at panix.com
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list