[Corpora-List] converting PDFs to ASCII or text-only files without clumps
Julien Nioche
lists.digitalpebble at gmail.com
Wed Jun 16 11:03:30 UTC 2010
Hi John,
Have you looked at Apache Tika (http://tika.apache.org)? It is an open
source library for extracting text and metadata from various formats,
including PDF.
HTH
Julien Nioche
--
DigitalPebble Ltd
Open Source Solutions for Text Engineering
http://www.digitalpebble.com
On 16 June 2010 11:40, John MCKENNY <john.mckenny at nottingham.edu.cn> wrote:
> Does anyone have a solution to the problem we are facing in a corpus
> linguistic research project? We have been given permission by the publishers
> and editors to download all issues of a journal from the last 30 years
> obtainable from our university e-library in the form of PDFs amounting to
> about 3,000,000 words. Starting with a small sample (250,000 words), we
> tried using various methods and software including Wordsmith Tools 5 to
> convert the PDFs into text-only files. The result so far has been text-only
> files with many words clumped together e.g. ‘inthefinalanalysisitseems’.
> Breaking up these clumps is a time-consuming business. For this reason, we
> haven’t started compiling our larger corpus. We would only build the larger
> corpus if there was some kind of automated or semi-automated way to generate
> text-only files which contained all and only the alphanumeric sequences
> bounded by spaces in the original PDFs, in other words, without clumps.
>
> We would be very grateful for any suggestions you might have.
>
> Best wishes
>
> John McKenny
> Deputy Head of the Division of English Studies
> University of Nottingham Ningbo, China
> 199 Taikang Dong Lu
> Ningbo, Zhejiang Province
> P.R.China 315100
>
> john.mckenny at nottingham.edu.cn
>
>
>
>
>
>
> ------------------------------
> This email has been scanned by the Altman Email Security System. For more
> information please visit www.altman.co.uk/emailsystems
> ------------------------------
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100616/3f51cf43/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list