[Corpora-List] converting PDFs to ASCII or text-only files without clumps
Roman Klinger
roman.klinger at scai.fhg.de
Wed Jun 16 10:55:04 UTC 2010
Hi,
On 06/16/2010 12:40 PM, John MCKENNY wrote:
> Does anyone have a solution to the problem we are facing in a corpus
> linguistic research project? We have been given permission by the
> publishers and editors to download all issues of a journal from the last
> 30 years obtainable from our university e-library in the form of PDFs
> amounting to about 3,000,000 words. Starting with a small sample
> (250,000 words), we tried using various methods and software including
> Wordsmith Tools 5 to convert the PDFs into text-only files. The result
> so far has been text-only files with many words clumped together e.g.
> ‘inthefinalanalysisitseems’. Breaking up these clumps is a
> time-consuming business.
The problem in PDF is, that spaces are normally not stored, but the
position of the glyphs on the page.
Therefore, the spaces need to be guessed.
I recommend to use a tool which gives you the information on the page
such that you can influence the parameter which distance between glyphs
should be counted as space.
Possibilities to do so could be PDFBox, pdftoxml and additionally, I
think, each text extracting library which is open source as this
parameter needs to be somewhere in the extraction process.
(I am using PDFBox and can recommend that.)
Best,
Roman
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list