[Corpora-List] converting PDFs to ASCII or text-only files without clumps

Wed Jun 16 11:00:45 UTC 2010

Hello!

This solution might not be of help for you in this stage, but in case you
decide do copy-paste text from an Adobe PDF file to a word processor, this
tool might be of interest to you:

AutoUnbreal 1.01: Simple application that removes line breaks from formatted
(or plain) text input

AutoUnbreak removes line breaks from formatted (or plain) text input. This
is useful if you want to reformat a text document, where lines have been cut
short, e.g. when copying text from an Adobe PDF file to a word processor.
Thus, the program removes any extraneous lines. AutoUnbreak will remove
these carriage returns/ line breaks in a very smart manner. For instance it
will try to reconstruct any hyphenated words and it will not merge lines if
they e.g. are a part of a numbered or bulleted list. AutoUnbreak is
customizable and you can change its "rules" by altering the plain text files
"merge.set" and "exceptions.set" to fit your needs:
http://www.softpedia.com/get/Office-tools/Other-Office-Tools/AutoUnbreak.sht
ml.

Best,

Ana Rita Remígio

University of Aveiro

Portugal

From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
John MCKENNY
Sent: quarta-feira, 16 de Junho de 2010 11:40
To: corpora at uib.no
Subject: [Corpora-List] converting PDFs to ASCII or text-only files without
clumps

Does anyone have a solution  to the problem we are facing in a corpus
linguistic research project? We have been given permission by the publishers
and editors  to download all issues of a journal from the last 30 years
obtainable from our university e-library in the form of PDFs amounting to
about 3,000,000 words. Starting with a small sample (250,000 words), we
tried  using various methods and software including Wordsmith Tools 5  to
convert the PDFs into text-only files. The result so far has been text-only
files with many words clumped together  e.g. ‘inthefinalanalysisitseems’.
Breaking up these clumps is a time-consuming business. For this reason, we
haven’t started compiling our larger corpus. We would only build the larger
corpus if there was some kind of automated or semi-automated way to generate
text-only files which contained all and only the alphanumeric sequences
bounded by spaces in the original PDFs, in other words, without clumps.

We would be very grateful for any suggestions you might have.

Best wishes

John McKenny
Deputy Head of the Division of English Studies
University of Nottingham Ningbo, China
199 Taikang Dong Lu
Ningbo, Zhejiang Province
P.R.China   315100

john.mckenny at nottingham.edu.cn

  _____  

This email has been scanned by the Altman Email Security System. For more
information please visit www.altman.co.uk/emailsystems

  _____  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100616/a11c85db/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora