[Corpora-List] corpus of textbooks; "Just download the PDF's and convert to text"
John F Sowa
sowa at bestweb.net
Mon Oct 15 13:14:44 UTC 2012
On 10/12/2012 6:28 AM, Krishnamurthy, Ramesh wrote:
> I know there are various problems:
>
> a) graphic PDFs rather than text PDFs - eg when people have scanned older texts
> that were not created/available as digitized text?
>
> b) columnar layout
>
> c) embedded graphics, eg photos, diagrams, graphs
>
> d) software that can only process one page at a time, or outputs one file per page
>
> e) minor irritations, such as page numbers and headers/footers that need to be edited out
On 10/12/2012 6:53 AM, Martin Reynaert wrote:
> The e) issue you raise is not minor when building a large corpus...
These issues can become nightmares in some cases. Postscript and PDF
allow blocks of text and graphics to be inserted into a page at any
location and in any order. Anyone who tries to analyze the PDF source
to extract a linear sequence of text may encounter serious obstacles:
1. In generating multi-column text, some formatters generate the page
one line at a time, starting from the top. The linear sequence
in the PDF file will contain all the columns interleaved.
2. To justify text, some formatters do not insert spaces of various
width into the text. Instead, they just calculate where each word
should go and place it there directly. As a result, the string
of text does not contain any blanks between words.
3. For very large fonts in titles and headings, some formatters
generate two lines of special characters -- one of the tops
of the letters and one for the bottoms.
4. Because of obstacles #1, #2, and #3 (and others), some PDF to text
analyzers generate an intermediate file in print format and use
OCR to translate it to text. But no OCR tool is perfect.
5. Among the problems with OCR are changes in fonts, changes from
roman to italic to bold to bold italic, etc. Letters with umlauts
and accents create problems, especially with characters used in
less common languages. Superscripts and subscripts are frequently
mangled. Mathematical formulas are almost always mangled.
Fortunately, most PDF files don't have all these challenges. But these
issues plague any software that processes a large corpus.
John Sowa
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list