Hi John, <br><br>Have you looked at Apache Tika (<a href="http://tika.apache.org">http://tika.apache.org</a>)? It is an open source library for extracting text and metadata from various formats, including PDF. <br><br>HTH<br>

<br>Julien Nioche<br>-- <br>DigitalPebble Ltd<br><br>Open Source Solutions for Text Engineering<br><a href="http://www.digitalpebble.com">http://www.digitalpebble.com</a><br><br><div class="gmail_quote">On 16 June 2010 11:40, John MCKENNY <span dir="ltr"><<a href="mailto:john.mckenny@nottingham.edu.cn">john.mckenny@nottingham.edu.cn</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


<div link="blue" vlink="purple" lang="EN-GB">


<div>


<p class="MsoNormal">Does anyone have a solution  to the problem we are

facing in a corpus linguistic research project? We have been given permission

by the publishers and editors  to download all issues of a journal from

the last 30 years obtainable from our university e-library in the form of PDFs

amounting to about 3,000,000 words. Starting with a small sample (250,000

words), we tried  using various methods and software including Wordsmith

Tools 5  to convert the PDFs into text-only files. The result so far has

been text-only files with many words clumped together  e.g. ‘inthefinalanalysisitseems’. 

Breaking up these clumps is a time-consuming business. For this reason, we haven’t

started compiling our larger corpus. We would only build the larger corpus if

there was some kind of automated or semi-automated way to generate text-only

files which contained all and only the alphanumeric sequences bounded by spaces

in the original PDFs, in other words, without clumps.</p>


<p class="MsoNormal">We would be very grateful for any suggestions you might have.</p>


<div>


<p class="MsoNormal">Best wishes</p>


<p class="MsoNormal"><span style="font-size: 10pt;">John

McKenny<br>

Deputy Head of the Division of English Studies<br>

University of Nottingham Ningbo, China</span><span style="font-size: 10pt;"><br>

</span><span style="font-size: 10pt;">199

Taikang Dong Lu<br>

Ningbo, Zhejiang Province<br>

P.R.China   315100</span><span style="font-size: 10pt;"><br>

<br>

</span><span style="font-size: 10pt;"></span></p>


<p class="MsoNormal"><span style="font-size: 10pt;"><a href="mailto:john.mckenny@nottingham.edu.cn" target="_blank">john.mckenny@nottingham.edu.cn</a></span></p>


<p class="MsoNormal"> </p>


</div>


<p class="MsoNormal"> </p>


<p class="MsoNormal"> </p>


</div>


<hr style="min-height: 1px; color: rgb(0, 0, 0);">This email has been scanned by the Altman Email Security System. For more information please visit <a href="http://www.altman.co.uk/emailsystems" target="_blank">www.altman.co.uk/emailsystems</a><hr style="min-height: 1px; color: rgb(0, 0, 0);">


</div>


<br>_______________________________________________<br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

<br></blockquote></div><br><br clear="all"><br><br>