[Corpora-List] converting PDFs to ASCII or text-only files without clumps
maxwell
maxwell at umiacs.umd.edu
Wed Jun 16 21:58:23 UTC 2010
On Wed, 16 Jun 2010 17:36:48 -0400, "Angus B. Grieve-Smith"
<grvsmth at panix.com> wrote:
> maxwell wrote:
>> I believe the OP said (s?)he was getting text out of them, but the word
>> boundaries were often missing. If so, then their PDFs obviously have
>> text in them, they're not just images. You can't get strings of text,
>> with or without word boundaries, out of image PDFs.
>>
> You sure can if you're using optical character recognition
> software. And OCR output often has missing word boundaries, especially
> if the software is not very good.
>
> I'd like to know if this is an OCR issue.
Not for me, I have no OCR software. This is just selecting text with a
mouse and copying it to the clipboard. It happens on several PDF viewers.
Mike Maxwell
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list