[Corpora-List] corpus of textbooks; "Just download the PDF's and convert to text"

Mon Oct 15 13:14:44 UTC 2012

On 10/12/2012 6:28 AM, Krishnamurthy, Ramesh wrote:
> I know there are various problems:
>
> a) graphic PDFs rather than text PDFs - eg when people have scanned older texts
> that were not created/available as digitized text?
>
> b) columnar layout
>
> c) embedded graphics, eg photos, diagrams, graphs
>
> d) software that can only process one page at a time, or outputs one file per page
>
> e) minor irritations, such as page numbers and headers/footers that need to be edited out

On 10/12/2012 6:53 AM, Martin Reynaert wrote:
> The e) issue you raise is not minor when building a large corpus...

These issues can become nightmares in some cases.  Postscript and PDF
allow blocks of text and graphics to be inserted into a page at any
location and in any order.  Anyone who tries to analyze the PDF source
to extract a linear sequence of text may encounter serious obstacles:

  1. In generating multi-column text, some formatters generate the page
     one line at a time, starting from the top.  The linear sequence
     in the PDF file will contain all the columns interleaved.

  2. To justify text, some formatters do not insert spaces of various
     width into the text.  Instead, they just calculate where each word
     should go and place it there directly.  As a result, the string
     of text does not contain any blanks between words.

  3. For very large fonts in titles and headings, some formatters
     generate two lines of special characters -- one of the tops
     of the letters and one for the bottoms.

  4. Because of obstacles #1, #2, and #3 (and others), some PDF to text
     analyzers generate an intermediate file in print format and use
     OCR to translate it to text.  But no OCR tool is perfect.

  5. Among the problems with OCR are changes in fonts, changes from
     roman to italic to bold to bold italic, etc.  Letters with umlauts
     and accents create problems, especially with characters used in
     less common languages.  Superscripts and subscripts are frequently
     mangled.  Mathematical formulas are almost always mangled.

Fortunately, most PDF files don't have all these challenges.  But these
issues plague any software that processes a large corpus.

John Sowa

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora