[Corpora-List] converting PDFs to ASCII or text-only files without clumps

Wed Jun 16 14:07:08 UTC 2010

John, 

A tool we found useful in the compilation of MICUSP
(http://micusp.elicorpora.info/) is PDF to Word (http://www.pdftoword.com/),
a free online tool that turns pdfs into doc or rtf files. We used this tool
for files that Adobe Reader couldn’t convert, for example if they were
password protected. You would then still have to open the output files in
Word and save them as text from there. For our purposes, going via Word
rather than straight to txt was the preferred option --- that way you get an
editable version of the text that is quite close to the original (including
figures, tables, etc) which makes it easier to insert gap tags or line
breaks in the right places. 

Best of luck with the project!

Ute 

*********************************************************

Just launched: MICUSP Simple -- free search/browse interface to the Michigan
Corpus of Upper-level Student Papers (829 papers, around 2.6 million words):

 <http://search-micusp.elicorpora.info/simple/>
http://search-micusp.elicorpora.info/simple/ 

Dr. Ute Römer

Director of the Applied Corpus Linguistics Unit

English Language Institute

University of Michigan

Email:  <mailto:uroemer at umich.edu> uroemer at umich.edu  

Fax: +1 734 763 0369  

 <http://www.elicorpora.info> http://www.elicorpora.info 

 <http://www.uteroemer.com> http://www.uteroemer.com 

Surface mail address: 

Dr. Ute Römer 

University of Michigan 

English Language Institute 

500 E. Washington Street 

Ann Arbor, MI 48104-2028 

USA 

From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
John MCKENNY
Sent: Wednesday, June 16, 2010 6:40 AM
To: corpora at uib.no
Subject: [Corpora-List] converting PDFs to ASCII or text-only files without
clumps

Does anyone have a solution  to the problem we are facing in a corpus
linguistic research project? We have been given permission by the publishers
and editors  to download all issues of a journal from the last 30 years
obtainable from our university e-library in the form of PDFs amounting to
about 3,000,000 words. Starting with a small sample (250,000 words), we
tried  using various methods and software including Wordsmith Tools 5  to
convert the PDFs into text-only files. The result so far has been text-only
files with many words clumped together  e.g. ‘inthefinalanalysisitseems’.
Breaking up these clumps is a time-consuming business. For this reason, we
haven’t started compiling our larger corpus. We would only build the larger
corpus if there was some kind of automated or semi-automated way to generate
text-only files which contained all and only the alphanumeric sequences
bounded by spaces in the original PDFs, in other words, without clumps.

We would be very grateful for any suggestions you might have.

Best wishes

John McKenny
Deputy Head of the Division of English Studies
University of Nottingham Ningbo, China
199 Taikang Dong Lu
Ningbo, Zhejiang Province
P.R.China   315100

john.mckenny at nottingham.edu.cn

  _____  

This email has been scanned by the Altman Email Security System. For more
information please visit www.altman.co.uk/emailsystems

  _____  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100616/e3854c60/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora