Corpora: Converting PDF files

ramesh at clg.bham.ac.uk ramesh at clg.bham.ac.uk
Fri Dec 28 14:54:34 UTC 2001


Dear All

In May 2001, I asked:
I'm working on a PC with Windows95.
I have MSWord 2000, Acrobat Reader5, and GSview3.6.
Can anyone tell me if it is possible to convert
PDF files into ASCII or MSWord?
And how....

I received many helpful replies, and
promised to post a summary, but forgot.

A colleague has just asked me about the same problem,
which reminded me that I did not post the summary.

So here it is. Apologies to anyone I have
forgotten.

Best
Ramesh Krishnamurthy
Consultant: COBUILD, Collins Dictionaries.
Hon. Res. Fellow: University of Birmingham.
Hon. Res. Fellow: University of Wolverhampton.


1. Kevin McTait (UMIST):
try the auto-email service at:
http://www.pdfzone.com/services/access.html

2. Ha Le An (Wolverhampton Uni):
the simplest way is select all, copy from Acrobat Reader, and paste into
word, but there is no way to keep the format, and images, and tables etc.

3. Fabio Tamburini (Bologna):
Open the file with GhostView, then choose menu EDIT, then "Text
Extract..." and an ASCII text file will be produced...
Pay attention to the formatting of the new file! ;-)
I have GSview3.3, but such feature should be available also in 3.6...

4. Mike Scott (Liverpool):
Adobe Acrobat, the full version, not just the Reader,
will export to various formats, haven't checked
them all yet though.

5. Chris Tribble (Sri Lanka):
I do this with the full Acrobat - I use version 4.  This has a text
selection tool.  Once you've clicked on this you can use Ctrl A to select
all text in the documenn if you've selected View, Continuous.  This text can
then be pasted to a notepad or word document.

6. Acrobat has an export to Postscript option. Then you can use a
`postscript-to-text' converter.

7. Everita Milconoka (Latvia):
You may try to send your .pdf file to
access-b at Adobe.COM
and then in subject line you have to write either pdf2txt or pdf2htm,
and after some minutes they will send you back the file in .txt or .htm
format.

8. Steven Krauwer (Netherlands):
Adobe offers on-line and email facilities for this
at http://access.adobe.com:80/simple_form.html

9. Philip Resnik (Maryland):
The solution was at
 http://www.research.compaq.com/SRC/virtualpaper/pstotext.html --
it seems to work very nicely for pdf2txt conversion at least
in the Unix version.

10. Simon G. J. Smith (Birmingham):
MSword -- www.adobe.com will do free conversions FROM word (they get emailed
back to you, and you can only do abt 5 per email address), but I don't know about the other way round.
To extract text from acrobat (mine is 4.0) choose the text select tool (capital T with a little
 box). Then just cut and paste the text you want. This works one page at a time.
>From ghostview (if it can read your particular PDF, sometimes doesn't work for
 me), do the whole thing at once by Edit|Text Extract. It's in the gsview help.
You can convert whole pages to bitmaps with gsview, and I think in Acrobat you
can select graphics from the pdf file (the Acrobat help says use the graphics select
tool, but I can't find this tool). The bitmap file can then be viewed from Word.

14. Jerome Richalot (Lyon)
Acrobat 5 apparently makes the whole difference. You can
download a plug-in from adobe.com called Access and add it on Acrobat to
convert from pdf to rtf.



More information about the Corpora mailing list