[Corpora-List] corpus of textbooks; "Just download the PDF's and convert to text"

John Williams j0hnwh0ever.corpora at gmail.com
Tue Oct 16 18:26:39 UTC 2012


Hi Ramesh,

I don't know if this is any good to you but I've just written a rough and
ready bash shellscript called add2corpus which will convert batches of .doc
, .docx, and .pdf files to plain text for use in a corpus. For the pdf part
I've used the pdftotext utility already mentioned in this thread, which is
not perfect, but there is no magic bullet for this task. I have in mind,
among other things, building corpora of student writing from assignments
submitted electronically or uploaded to TurnItIn (not fully tested on the
latter yet).

I will look with interest at the other pdf converters mentioned here, to
see if I can improve the script. pdftotext can handle some, but not all,
scanned material.

add2corpus, together with the associated README, is available from here:
http://bit.ly/jw-public
(Expert programmers reading this, please bear in mind I am not an expert
programmer !)

Please copy any relevant replies to john.x.williams -at- port.ac.uk , as I
read this list only occasionally these days.

add2corpus will be making its official debut at Corpus Linguistics in the
South at the University of Portsmouth on November 12th.

Best wishes,

j0hn

----

John Williams
Lecturer in Language & Linguistics
University of Portsmouth
http://www.port.ac.uk/departments/academic/slas/staff/title,123663,en.html




2012/10/15 John F Sowa <sowa at bestweb.net>

> On 10/12/2012 6:28 AM, Krishnamurthy, Ramesh wrote:
>
>> I know there are various problems:
>>
>> a) graphic PDFs rather than text PDFs - eg when people have scanned older
>> texts
>> that were not created/available as digitized text?
>>
>> b) columnar layout
>>
>> c) embedded graphics, eg photos, diagrams, graphs
>>
>> d) software that can only process one page at a time, or outputs one file
>> per page
>>
>> e) minor irritations, such as page numbers and headers/footers that need
>> to be edited out
>>
>
> On 10/12/2012 6:53 AM, Martin Reynaert wrote:
>
>> The e) issue you raise is not minor when building a large corpus...
>>
>
> These issues can become nightmares in some cases.  Postscript and PDF
> allow blocks of text and graphics to be inserted into a page at any
> location and in any order.  Anyone who tries to analyze the PDF source
> to extract a linear sequence of text may encounter serious obstacles:
>
>  1. In generating multi-column text, some formatters generate the page
>     one line at a time, starting from the top.  The linear sequence
>     in the PDF file will contain all the columns interleaved.
>
>  2. To justify text, some formatters do not insert spaces of various
>     width into the text.  Instead, they just calculate where each word
>     should go and place it there directly.  As a result, the string
>     of text does not contain any blanks between words.
>
>  3. For very large fonts in titles and headings, some formatters
>     generate two lines of special characters -- one of the tops
>     of the letters and one for the bottoms.
>
>  4. Because of obstacles #1, #2, and #3 (and others), some PDF to text
>     analyzers generate an intermediate file in print format and use
>     OCR to translate it to text.  But no OCR tool is perfect.
>
>  5. Among the problems with OCR are changes in fonts, changes from
>     roman to italic to bold to bold italic, etc.  Letters with umlauts
>     and accents create problems, especially with characters used in
>     less common languages.  Superscripts and subscripts are frequently
>     mangled.  Mathematical formulas are almost always mangled.
>
> Fortunately, most PDF files don't have all these challenges.  But these
> issues plague any software that processes a large corpus.
>
> John Sowa
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121016/a34e45e5/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list