[Corpora-List] converting PDFs to ASCII or text-only files without clumps

John K Pate j.k.pate at sms.ed.ac.uk
Wed Jun 16 11:12:56 UTC 2010


On Wed, 16 Jun 2010, John MCKENNY wrote:

> 
> Does anyone have a solution  to the problem we are facing in a corpus linguistic research project? We have been given permission by the publishers and editors  to download all issues of a journal from the last 30 years obtainable from our university e-library in the form of
> PDFs amounting to about 3,000,000 words. Starting with a small sample (250,000 words), we tried  using various methods and software including Wordsmith Tools 5  to convert the PDFs into text-only files. The result so far has been text-only files with many words clumped
> together  e.g. ‘inthefinalanalysisitseems’.  Breaking up these clumps is a time-consuming business. For this reason, we haven’t started compiling our larger corpus. We would only build the larger corpus if there was some kind of automated or semi-automated way to generate
> text-only files which contained all and only the alphanumeric sequences bounded by spaces in the original PDFs, in other words, without clumps.
> 
> We would be very grateful for any suggestions you might have.
> 
> Best wishes
> 
> John McKenny
> Deputy Head of the Division of English Studies
> University of Nottingham Ningbo, China
> 199 Taikang Dong Lu
> Ningbo, Zhejiang Province
> P.R.China   315100
> 
> john.mckenny at nottingham.edu.cn

Have you tried pdftotext? It's part of the xpdf project and has worked
well for me when I want to read a pdf without launching a windowed viewer.

http://www.foolabs.com/xpdf/download.html
http://www.foolabs.com/xpdf/home.html

You could exclude non-alphanumerics by piping the output to sed or
similar.

John

==

John K Pate
Student, PhD Informatics
Informatics Forum 3.35
The University of Edinburgh
http://homepages.inf.ed.ac.uk/s0930006/
-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list