[Corpora-List] corpus of textbooks; "Just download the PDF's and convert to text"

True Friend true.friend2004 at gmail.com
Fri Oct 12 12:02:22 UTC 2012


Hi All
An OCR can help in case scanned pdfs. E.g.Abbay Fine Reader.
Regards
On Oct 12, 2012 4:39 PM, "Andrew Gilbert" <andy at agilbert.net> wrote:

> poppler is an OSS package with some nice weapons for this
>
> pdftotext will convert to plain text
>
> But perhaps more helpful for retaining some of the column and layout
> information, can also use pdftohtml to convert to xml format with
> positional data, for example:
>
> pdftohtml -xml input.pdf output.xml
>
> <text top="79" left="652" width="171" height="12" font="2"><b>LOCATION:
> NIKKEN BUILDING</b></text>
> <text top="92" left="121" width="129" height="12" font="4">Woodland Hills,
> CA 91367</text>
> <text top="91" left="652" width="140" height="12" font="2"><b>
>        52 Discovery</b></text>
>
>
> Andrew Gilbert
> andy at agilbert.net
> (m) 802-535-1653
> (h) 802-426-2108
>
>
>
>
>
> On Oct 12, 2012, at 6:28 AM, "Krishnamurthy, Ramesh" <
> r.krishnamurthy at aston.ac.uk> wrote:
>
> >
> >
> >
> >
> > Hi Mark
> >
> > Several people have asked recently about the easiest way to convert PDF
> files to plain text
> >
> > (including Rama Meganathan on this list). I know there are various
> problems:
> >
> > a) graphic PDFs rather than text PDFs - eg when people have scanned
> older texts
> >
> > that were not created/available as digitized text?
> >
> > b) columnar layout
> >
> > c) embedded graphics, eg photos, diagrams, graphs
> >
> > d) software that can only process one page at a time, or outputs one
> file per page
> >
> > e) minor irritations, such as page numbers and headers/footers that need
> to be edited out
> >
> >
> >
> > What is curently the easiest method/software to convert PDF files to
> plain text files?
> >
> >
> >
> > best
> >
> > Ramesh
> >
> > -------------------------
> >
> > Date: Thu, 11 Oct 2012 15:37:54 +0000
> > From: Mark Davies <Mark_Davies at byu.edu>
> > Subject: Re: [Corpora-List] corpus of textbooks
> > To: MAT T <terrettgnome at hotmail.com>, "corpora at uib.no"
> > <corpora at uib.no>
> >
> > Lots of free textbooks (legally!) at: http://www.ck12.org/ . Just
> download the PDF's and convert to text.
> >
> > Mark Davies
> >
> > ============================================
> > Mark Davies
> > Professor of Linguistics / Brigham Young University
> > http://davies-linguistics.byu.edu/
> > ** Corpus design and use // Linguistic databases **
> > ** Historical linguistics // Language variation **
> > ** English, Spanish, and Portuguese **
> > ============================================
> >
> >
> > _______________________________________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121012/d7a21b15/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list