<p>Hi All <br>

An OCR can help in case scanned pdfs. E.g.Abbay Fine Reader. <br>

Regards </p>

<div class="gmail_quote">On Oct 12, 2012 4:39 PM, "Andrew Gilbert" <<a href="mailto:andy@agilbert.net">andy@agilbert.net</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

poppler is an OSS package with some nice weapons for this<br>

<br>

pdftotext will convert to plain text<br>

<br>

But perhaps more helpful for retaining some of the column and layout information, can also use pdftohtml to convert to xml format with positional data, for example:<br>

<br>

pdftohtml -xml input.pdf output.xml<br>

<br>

<text top="79" left="652" width="171" height="12" font="2"><b>LOCATION: NIKKEN BUILDING</b></text><br>

<text top="92" left="121" width="129" height="12" font="4">Woodland Hills, CA 91367</text><br>

<text top="91" left="652" width="140" height="12" font="2"><b>                    52 Discovery</b></text><br>

<br>

<br>

Andrew Gilbert<br>

<a href="mailto:andy@agilbert.net">andy@agilbert.net</a><br>

(m) 802-535-1653<br>

(h) 802-426-2108<br>

<br>

<br>

<br>

<br>

<br>

On Oct 12, 2012, at 6:28 AM, "Krishnamurthy, Ramesh" <<a href="mailto:r.krishnamurthy@aston.ac.uk">r.krishnamurthy@aston.ac.uk</a>> wrote:<br>

<br>

><br>

><br>

><br>

><br>

> Hi Mark<br>

><br>

> Several people have asked recently about the easiest way to convert PDF files to plain text<br>

><br>

> (including Rama Meganathan on this list). I know there are various problems:<br>

><br>

> a) graphic PDFs rather than text PDFs - eg when people have scanned older texts<br>

><br>

> that were not created/available as digitized text?<br>

><br>

> b) columnar layout<br>

><br>

> c) embedded graphics, eg photos, diagrams, graphs<br>

><br>

> d) software that can only process one page at a time, or outputs one file per page<br>

><br>

> e) minor irritations, such as page numbers and headers/footers that need to be edited out<br>

><br>

><br>

><br>

> What is curently the easiest method/software to convert PDF files to plain text files?<br>

><br>

><br>

><br>

> best<br>

><br>

> Ramesh<br>

><br>

> -------------------------<br>

><br>

> Date: Thu, 11 Oct 2012 15:37:54 +0000<br>

> From: Mark Davies <<a href="mailto:Mark_Davies@byu.edu">Mark_Davies@byu.edu</a>><br>

> Subject: Re: [Corpora-List] corpus of textbooks<br>

> To: MAT T <<a href="mailto:terrettgnome@hotmail.com">terrettgnome@hotmail.com</a>>, "<a href="mailto:corpora@uib.no">corpora@uib.no</a>"<br>

> <<a href="mailto:corpora@uib.no">corpora@uib.no</a>><br>

><br>

> Lots of free textbooks (legally!) at: <a href="http://www.ck12.org/" target="_blank">http://www.ck12.org/</a> . Just download the PDF's and convert to text.<br>

><br>

> Mark Davies<br>

><br>

> ============================================<br>

> Mark Davies<br>

> Professor of Linguistics / Brigham Young University<br>

> <a href="http://davies-linguistics.byu.edu/" target="_blank">http://davies-linguistics.byu.edu/</a><br>

> ** Corpus design and use // Linguistic databases **<br>

> ** Historical linguistics // Language variation **<br>

> ** English, Spanish, and Portuguese **<br>

> ============================================<br>

><br>

><br>

> _______________________________________________<br>

> UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

> Corpora mailing list<br>

> <a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

> <a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

<br>

<br>

_______________________________________________<br>

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

</blockquote></div>