[Corpora-List] converting PDFs to ASCII or text-only files without clumps

Roman Klinger roman.klinger at scai.fhg.de
Wed Jun 16 10:55:04 UTC 2010


Hi,

On 06/16/2010 12:40 PM, John MCKENNY wrote:
> Does anyone have a solution  to the problem we are facing in a corpus
> linguistic research project? We have been given permission by the
> publishers and editors to download all issues of a journal from the last
> 30 years obtainable from our university e-library in the form of PDFs
> amounting to about 3,000,000 words. Starting with a small sample
> (250,000 words), we tried using various methods and software including
> Wordsmith Tools 5 to convert the PDFs into text-only files. The result
> so far has been text-only files with many words clumped together e.g.
> ‘inthefinalanalysisitseems’. Breaking up these clumps is a
> time-consuming business.

The problem in PDF is, that spaces are normally not stored, but the 
position of the glyphs on the page.

Therefore, the spaces need to be guessed.

I recommend to use a tool which gives you the information on the page 
such that you can influence the parameter which distance between glyphs 
should be counted as space.

Possibilities to do so could be PDFBox, pdftoxml and additionally, I 
think, each text extracting library which is open source as this 
parameter needs to be somewhere in the extraction process.

(I am using PDFBox and can recommend that.)

Best,
  Roman

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list