[Corpora-List] converting PDFs to ASCII or text-only files without clumps
Ute Römer
uroemer at umich.edu
Wed Jun 16 14:07:08 UTC 2010
John,
A tool we found useful in the compilation of MICUSP
(http://micusp.elicorpora.info/) is PDF to Word (http://www.pdftoword.com/),
a free online tool that turns pdfs into doc or rtf files. We used this tool
for files that Adobe Reader couldnt convert, for example if they were
password protected. You would then still have to open the output files in
Word and save them as text from there. For our purposes, going via Word
rather than straight to txt was the preferred option --- that way you get an
editable version of the text that is quite close to the original (including
figures, tables, etc) which makes it easier to insert gap tags or line
breaks in the right places.
Best of luck with the project!
Ute
*********************************************************
Just launched: MICUSP Simple -- free search/browse interface to the Michigan
Corpus of Upper-level Student Papers (829 papers, around 2.6 million words):
<http://search-micusp.elicorpora.info/simple/>
http://search-micusp.elicorpora.info/simple/
Dr. Ute Römer
Director of the Applied Corpus Linguistics Unit
English Language Institute
University of Michigan
Email: <mailto:uroemer at umich.edu> uroemer at umich.edu
Fax: +1 734 763 0369
<http://www.elicorpora.info> http://www.elicorpora.info
<http://www.uteroemer.com> http://www.uteroemer.com
Surface mail address:
Dr. Ute Römer
University of Michigan
English Language Institute
500 E. Washington Street
Ann Arbor, MI 48104-2028
USA
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
John MCKENNY
Sent: Wednesday, June 16, 2010 6:40 AM
To: corpora at uib.no
Subject: [Corpora-List] converting PDFs to ASCII or text-only files without
clumps
Does anyone have a solution to the problem we are facing in a corpus
linguistic research project? We have been given permission by the publishers
and editors to download all issues of a journal from the last 30 years
obtainable from our university e-library in the form of PDFs amounting to
about 3,000,000 words. Starting with a small sample (250,000 words), we
tried using various methods and software including Wordsmith Tools 5 to
convert the PDFs into text-only files. The result so far has been text-only
files with many words clumped together e.g. inthefinalanalysisitseems.
Breaking up these clumps is a time-consuming business. For this reason, we
havent started compiling our larger corpus. We would only build the larger
corpus if there was some kind of automated or semi-automated way to generate
text-only files which contained all and only the alphanumeric sequences
bounded by spaces in the original PDFs, in other words, without clumps.
We would be very grateful for any suggestions you might have.
Best wishes
John McKenny
Deputy Head of the Division of English Studies
University of Nottingham Ningbo, China
199 Taikang Dong Lu
Ningbo, Zhejiang Province
P.R.China 315100
john.mckenny at nottingham.edu.cn
_____
This email has been scanned by the Altman Email Security System. For more
information please visit www.altman.co.uk/emailsystems
_____
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100616/e3854c60/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list