[Corpora-List] converting PDFs to ASCII or text-only files without clumps
John MCKENNY
john.mckenny at nottingham.edu.cn
Wed Jun 16 10:40:01 UTC 2010
Does anyone have a solution to the problem we are facing in a corpus linguistic research project? We have been given permission by the publishers and editors to download all issues of a journal from the last 30 years obtainable from our university e-library in the form of PDFs amounting to about 3,000,000 words. Starting with a small sample (250,000 words), we tried using various methods and software including Wordsmith Tools 5 to convert the PDFs into text-only files. The result so far has been text-only files with many words clumped together e.g. 'inthefinalanalysisitseems'. Breaking up these clumps is a time-consuming business. For this reason, we haven't started compiling our larger corpus. We would only build the larger corpus if there was some kind of automated or semi-automated way to generate text-only files which contained all and only the alphanumeric sequences bounded by spaces in the original PDFs, in other words, without clumps.
We would be very grateful for any suggestions you might have.
Best wishes
John McKenny
Deputy Head of the Division of English Studies
University of Nottingham Ningbo, China
199 Taikang Dong Lu
Ningbo, Zhejiang Province
P.R.China 315100
john.mckenny at nottingham.edu.cn<mailto:john.mckenny at nottingham.edu.cn>
*******************************************************************************************************************************
This email has been scanned by the Altman Email Security System.
For more information please visit www.altman.co.uk/emailsystems
*******************************************************************************************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100616/a5e6e292/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list