[Corpora-List] corpus of textbooks; "Just download the PDF's and convert to text"

Krishnamurthy, Ramesh r.krishnamurthy at aston.ac.uk
Fri Oct 12 10:28:31 UTC 2012


Hi Mark

Several people have asked recently about the easiest way to convert PDF files to plain text

(including Rama Meganathan on this list). I know there are various problems:

a) graphic PDFs rather than text PDFs - eg when people have scanned older texts

that were not created/available as digitized text?

b) columnar layout

c) embedded graphics, eg photos, diagrams, graphs

d) software that can only process one page at a time, or outputs one file per page

e) minor irritations, such as page numbers and headers/footers that need to be edited out



What is curently the easiest method/software to convert PDF files to plain text files?



best

Ramesh

-------------------------

Date: Thu, 11 Oct 2012 15:37:54 +0000
From: Mark Davies <Mark_Davies at byu.edu>
Subject: Re: [Corpora-List] corpus of textbooks
To: MAT T <terrettgnome at hotmail.com>, "corpora at uib.no"
<corpora at uib.no>

Lots of free textbooks (legally!) at: http://www.ck12.org/ . Just download the PDF's and convert to text.

Mark Davies

============================================
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/
** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list