[Corpora-List] converting PDFs to ASCII or text-only files without clumps

Wed Jun 16 11:41:13 UTC 2010

Hi,

On 06/16/2010 12:40 PM, John MCKENNY wrote:
> Does anyone have a solution  to the problem we are facing in a corpus
> linguistic research project? We have been given permission by the
> publishers and editors to download all issues of a journal from the last
> 30 years obtainable from our university e-library in the form of PDFs
> amounting to about 3,000,000 words. Starting with a small sample
> (250,000 words), we tried using various methods and software including
> Wordsmith Tools 5 to convert the PDFs into text-only files. The result
> so far has been text-only files with many words clumped together e.g.
> ‘inthefinalanalysisitseems’. Breaking up these clumps is a
> time-consuming business.

The problem in PDF is, that spaces are normally not stored, but the 
position of the glyphs on the page.

Therefore, the spaces need to be guessed.

I recommend to use a tool which gives you the information on the page 
such that you can influence the parameter which distance between glyphs 
should be counted as space.

Possibilities to do so could be PDFBox, pdftoxml and additionally, I 
think, each text extracting library which is open source as this 
parameter needs to be somewhere in the extraction process.

(I am using PDFBox and can recommend that.)

Best,
   Roman

-- 
Roman Klinger
Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
Department of Bioinformatics
Schloss Birlinghoven
D-53754 Sankt Augustin
Tel.: +49-2241-14-2360
Fax.: +49-2241-14-4-2360
email: roman.klinger at scai.fhg.de
http://www.scai.fraunhofer.de/klinger.html

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora