Hi John, <br><br>Have you looked at Apache Tika (<a href="http://tika.apache.org">http://tika.apache.org</a>)? It is an open source library for extracting text and metadata from various formats, including PDF. <br><br>HTH<br>
<br>Julien Nioche<br>-- <br>DigitalPebble Ltd<br><br>Open Source Solutions for Text Engineering<br><a href="http://www.digitalpebble.com">http://www.digitalpebble.com</a><br><br><div class="gmail_quote">On 16 June 2010 11:40, John MCKENNY <span dir="ltr"><<a href="mailto:john.mckenny@nottingham.edu.cn">john.mckenny@nottingham.edu.cn</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div link="blue" vlink="purple" lang="EN-GB">
<div>
<p class="MsoNormal">Does anyone have a solution to the problem we are
facing in a corpus linguistic research project? We have been given permission
by the publishers and editors to download all issues of a journal from
the last 30 years obtainable from our university e-library in the form of PDFs
amounting to about 3,000,000 words. Starting with a small sample (250,000
words), we tried using various methods and software including Wordsmith
Tools 5 to convert the PDFs into text-only files. The result so far has
been text-only files with many words clumped together e.g. ‘inthefinalanalysisitseems’.
Breaking up these clumps is a time-consuming business. For this reason, we haven’t
started compiling our larger corpus. We would only build the larger corpus if
there was some kind of automated or semi-automated way to generate text-only
files which contained all and only the alphanumeric sequences bounded by spaces
in the original PDFs, in other words, without clumps.</p>
<p class="MsoNormal">We would be very grateful for any suggestions you might have.</p>
<div>
<p class="MsoNormal">Best wishes</p>
<p class="MsoNormal"><span style="font-size: 10pt;">John
McKenny<br>
Deputy Head of the Division of English Studies<br>
University of Nottingham Ningbo, China</span><span style="font-size: 10pt;"><br>
</span><span style="font-size: 10pt;">199
Taikang Dong Lu<br>
Ningbo, Zhejiang Province<br>
P.R.China 315100</span><span style="font-size: 10pt;"><br>
<br>
</span><span style="font-size: 10pt;"></span></p>
<p class="MsoNormal"><span style="font-size: 10pt;"><a href="mailto:john.mckenny@nottingham.edu.cn" target="_blank">john.mckenny@nottingham.edu.cn</a></span></p>
<p class="MsoNormal"> </p>
</div>
<p class="MsoNormal"> </p>
<p class="MsoNormal"> </p>
</div>
<hr style="min-height: 1px; color: rgb(0, 0, 0);">This email has been scanned by the Altman Email Security System. For more information please visit <a href="http://www.altman.co.uk/emailsystems" target="_blank">www.altman.co.uk/emailsystems</a><hr style="min-height: 1px; color: rgb(0, 0, 0);">
</div>
<br>_______________________________________________<br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
<br></blockquote></div><br><br clear="all"><br><br>