[Corpora-List] converting PDFs to ASCII or text-only files without clumps

Wed Jun 16 11:03:30 UTC 2010

Hi John,

Have you looked at Apache Tika (http://tika.apache.org)? It is an open
source library for extracting text and metadata from various formats,
including PDF.

HTH

Julien Nioche
-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

On 16 June 2010 11:40, John MCKENNY <john.mckenny at nottingham.edu.cn> wrote:

>  Does anyone have a solution  to the problem we are facing in a corpus
> linguistic research project? We have been given permission by the publishers
> and editors  to download all issues of a journal from the last 30 years
> obtainable from our university e-library in the form of PDFs amounting to
> about 3,000,000 words. Starting with a small sample (250,000 words), we
> tried  using various methods and software including Wordsmith Tools 5  to
> convert the PDFs into text-only files. The result so far has been text-only
> files with many words clumped together  e.g. ‘inthefinalanalysisitseems’.
> Breaking up these clumps is a time-consuming business. For this reason, we
> haven’t started compiling our larger corpus. We would only build the larger
> corpus if there was some kind of automated or semi-automated way to generate
> text-only files which contained all and only the alphanumeric sequences
> bounded by spaces in the original PDFs, in other words, without clumps.
>
> We would be very grateful for any suggestions you might have.
>
> Best wishes
>
> John McKenny
> Deputy Head of the Division of English Studies
> University of Nottingham Ningbo, China
> 199 Taikang Dong Lu
> Ningbo, Zhejiang Province
> P.R.China   315100
>
>  john.mckenny at nottingham.edu.cn
>
>
>
>
>
>
>  ------------------------------
> This email has been scanned by the Altman Email Security System. For more
> information please visit www.altman.co.uk/emailsystems
> ------------------------------
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100616/3f51cf43/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora