[Corpora-List] corpus of textbooks; "Just download the PDF's and convert to text"

Andrew Gilbert andy at agilbert.net
Fri Oct 12 11:30:07 UTC 2012


poppler is an OSS package with some nice weapons for this

pdftotext will convert to plain text

But perhaps more helpful for retaining some of the column and layout information, can also use pdftohtml to convert to xml format with positional data, for example:

pdftohtml -xml input.pdf output.xml 

<text top="79" left="652" width="171" height="12" font="2"><b>LOCATION: NIKKEN BUILDING</b></text>
<text top="92" left="121" width="129" height="12" font="4">Woodland Hills, CA 91367</text>
<text top="91" left="652" width="140" height="12" font="2"><b>                    52 Discovery</b></text>


Andrew Gilbert
andy at agilbert.net
(m) 802-535-1653
(h) 802-426-2108





On Oct 12, 2012, at 6:28 AM, "Krishnamurthy, Ramesh" <r.krishnamurthy at aston.ac.uk> wrote:

> 
> 
> 
> 
> Hi Mark
> 
> Several people have asked recently about the easiest way to convert PDF files to plain text
> 
> (including Rama Meganathan on this list). I know there are various problems:
> 
> a) graphic PDFs rather than text PDFs - eg when people have scanned older texts
> 
> that were not created/available as digitized text?
> 
> b) columnar layout
> 
> c) embedded graphics, eg photos, diagrams, graphs
> 
> d) software that can only process one page at a time, or outputs one file per page
> 
> e) minor irritations, such as page numbers and headers/footers that need to be edited out
> 
> 
> 
> What is curently the easiest method/software to convert PDF files to plain text files?
> 
> 
> 
> best
> 
> Ramesh
> 
> -------------------------
> 
> Date: Thu, 11 Oct 2012 15:37:54 +0000
> From: Mark Davies <Mark_Davies at byu.edu>
> Subject: Re: [Corpora-List] corpus of textbooks
> To: MAT T <terrettgnome at hotmail.com>, "corpora at uib.no"
> <corpora at uib.no>
> 
> Lots of free textbooks (legally!) at: http://www.ck12.org/ . Just download the PDF's and convert to text.
> 
> Mark Davies
> 
> ============================================
> Mark Davies
> Professor of Linguistics / Brigham Young University
> http://davies-linguistics.byu.edu/
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
> ============================================
> 
> 
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list