[Corpora-List] converting PDFs to ASCII or text-only files without clumps

maxwell maxwell at umiacs.umd.edu
Wed Jun 16 19:06:01 UTC 2010


On Wed, 16 Jun 2010 10:13:41 -0600, Mark Davies <Mark_Davies at byu.edu>
wrote:
> Angus Grieve-Smith noted:
> 
>>>     It's not clear from the initial description: are these PDF files
>>>     with text in them, or with scanned images?  That makes a big
>>>     difference.
> 
> This is the crucial point, which few others seem to have addressed. 

I believe the OP said (s?)he was getting text out of them, but the word
boundaries were often missing.  If so, then their PDFs obviously have text
in them, they're not just images.  You can't get strings of text, with or
without word boundaries, out of image PDFs.

And even if the OP didn't say so, this (getting strings with no word
boundaries) has happened to me with PDFs that contain text.  And it is a
Nuisance.

As an aside, there are computational linguistic methods to insert word
boundaries in texts of languages where spaces are not written between words
(Chinese, Japanese, Thai).  Given the amount of training text available for
English, I would think it would be straightforward to create such a tool
for English.  And yes, it would probably do quite well with ambiguities by
using the statistical properties of neighboring words (probably at least as
well as most humans would do).  I don't know of anyone who has done this,
but a query in the right places would probably come up with ideas.  (There
has been some experimental work on inducing word boundaries in English
without training text, but that's a different--and harder--problem.)  It
might even be something that a beginning comp ling class could tackle.

   MikeMaxwell

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list