[Corpora-List] converting PDFs to ASCII or text-only files without clumps

Wed Jun 16 14:52:34 UTC 2010

On Wed, 16 Jun 2010, Roman Klinger <roman.klinger at scai.fraunhofer.de> wrote:

> On 06/16/2010 12:40 PM, John MCKENNY wrote:
> > ... the form of PDFs ... so far has been text-only files with many
> > words clumped together e.g. ‘inthefinalanalysisitseems’. Breaking up
> > these clumps is a time-consuming business.
>
> The problem in PDF is, that spaces are normally not stored, but the
> position of the glyphs on the page.

Although not seen in practice the problem could be more complex than that.
Most PDF generators will spew a series of glyphs out in the same sequence
as in the original text. But it doesn't have to. A PDF page image
producers could just as well impose every odd numbered word onto the page
then impose every even numbered word. The extraction tools in existence
utilise the simplistic scheme of imposition order following textual order
... but none of them have to.

In an effort to prevent the very extraction that John McKenny wants to
undertake a producer could impose glyphs in a random sequence. It would be
possible --- though probably computationally expensive --- to impose
individical glyphs in a random order. The human reader would perceive the
final text as being correct and readable but the original document could
not be recovered programmatically.

Regards, Trevor

<>< Re: deemed!

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora