<div>Hi Ramesh,<br></div><div><br></div><div>I don't know if this is any good to you but I've just written a rough and ready bash shellscript called add2corpus which will convert batches of .doc , .docx, and .pdf files to plain text for use in a corpus. For the pdf part I've used the pdftotext utility already mentioned in this thread, which is not perfect, but there is no magic bullet for this task. I have in mind, among other things, building corpora of student writing from assignments submitted electronically or uploaded to TurnItIn (not fully tested on the latter yet).</div>

<div><br></div><div>I will look with interest at the other pdf converters mentioned here, to see if I can improve the script. pdftotext can handle some, but not all, scanned material.</div><div><br></div><div>add2corpus, together with the associated README, is available from here: <a href="http://bit.ly/jw-public">http://bit.ly/jw-public</a> </div>

<div>(Expert programmers reading this, please bear in mind I am not an expert programmer !)</div><div><br></div><div>Please copy any relevant replies to john.x.williams -at- <a href="http://port.ac.uk">port.ac.uk</a> , as I read this list only occasionally these days.</div>

<div><br></div><div>add2corpus will be making its official debut at Corpus Linguistics in the South at the University of Portsmouth on November 12th.</div><div><br></div><div>Best wishes,</div><div><br></div><div>j0hn</div>

<div><br></div><div>----</div><div><br></div><div>John Williams</div><div>Lecturer in Language & Linguistics</div><div>University of Portsmouth</div><div><a href="http://www.port.ac.uk/departments/academic/slas/staff/title,123663,en.html">http://www.port.ac.uk/departments/academic/slas/staff/title,123663,en.html</a></div>

<div><br></div><div><br></div><div><br></div><div><br></div><div class="gmail_quote">2012/10/15 John F Sowa <span dir="ltr"><<a href="mailto:sowa@bestweb.net" target="_blank">sowa@bestweb.net</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im">On 10/12/2012 6:28 AM, Krishnamurthy, Ramesh wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I know there are various problems:<br>

<br>

a) graphic PDFs rather than text PDFs - eg when people have scanned older texts<br>

that were not created/available as digitized text?<br>

<br>

b) columnar layout<br>

<br>

c) embedded graphics, eg photos, diagrams, graphs<br>

<br>

d) software that can only process one page at a time, or outputs one file per page<br>

<br>

e) minor irritations, such as page numbers and headers/footers that need to be edited out<br>

</blockquote>

<br></div><div class="im">

On 10/12/2012 6:53 AM, Martin Reynaert wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

The e) issue you raise is not minor when building a large corpus...<br>

</blockquote>

<br></div>

These issues can become nightmares in some cases.  Postscript and PDF<br>

allow blocks of text and graphics to be inserted into a page at any<br>

location and in any order.  Anyone who tries to analyze the PDF source<br>

to extract a linear sequence of text may encounter serious obstacles:<br>

<br>

 1. In generating multi-column text, some formatters generate the page<br>

    one line at a time, starting from the top.  The linear sequence<br>

    in the PDF file will contain all the columns interleaved.<br>

<br>

 2. To justify text, some formatters do not insert spaces of various<br>

    width into the text.  Instead, they just calculate where each word<br>

    should go and place it there directly.  As a result, the string<br>

    of text does not contain any blanks between words.<br>

<br>

 3. For very large fonts in titles and headings, some formatters<br>

    generate two lines of special characters -- one of the tops<br>

    of the letters and one for the bottoms.<br>

<br>

 4. Because of obstacles #1, #2, and #3 (and others), some PDF to text<br>

    analyzers generate an intermediate file in print format and use<br>

    OCR to translate it to text.  But no OCR tool is perfect.<br>

<br>

 5. Among the problems with OCR are changes in fonts, changes from<br>

    roman to italic to bold to bold italic, etc.  Letters with umlauts<br>

    and accents create problems, especially with characters used in<br>

    less common languages.  Superscripts and subscripts are frequently<br>

    mangled.  Mathematical formulas are almost always mangled.<br>

<br>

Fortunately, most PDF files don't have all these challenges.  But these<br>

issues plague any software that processes a large corpus.<span class="HOEnZb"><font color="#888888"><br>

<br>

John Sowa</font></span><div class="HOEnZb"><div class="h5"><br>

<br>

_______________________________________________<br>

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

</div></div></blockquote></div><br>