Corpora: Converting PDF files

Damon Allen Davison davison at socal.rr.com
Sat Dec 29 01:52:12 UTC 2001


A lovely summary with many useful links.

Slightly tangental to this discussion, I wanted to make a general remark
on conversion, especially from image format.

I wanted to underline that the current (5.0) full version of Adobe
Acrobat can convert PDF files to RTF (without a plugin, actually), which
most word processors can open.  This, however, depends on the kind of
PDF you are dealing with.  If the file in question is text-encoded into
a PDF, then there is no problem.  There are many tools for doing this,
as the summary has shown.  On the other hand, if the text from the PDF
file is actually an image (bitmap), then you would have to extract the
images into TIFF format (or some other lossless compressed format) and
run them though OCR software, preferably with a more robust consumer
product like Caere/Scansoft's OmniPage Pro 11.  All of this has to be
done on the Windows platform, since there are no comparable products for
Linux (never was) or Macintosh (discontinued).

I have had to go through a similar process using texts from the
Bibliothèque Nationale de France's Gallica project.  Fortunately, many
of the texts there have already been OCR'd, making the process a lot
easier.

Warmest Regards,


Damon Allen Davison

On Fri, 2001-12-28 at 06:54, ramesh at clg2.bham.ac.uk wrote:
> 
> Dear All
> 
> In May 2001, I asked:
> I'm working on a PC with Windows95.
> I have MSWord 2000, Acrobat Reader5, and GSview3.6.
> Can anyone tell me if it is possible to convert
> PDF files into ASCII or MSWord?
> And how....
> 
> I received many helpful replies, and
> promised to post a summary, but forgot.
> 
[...]
> 
-- 
Damon Allen Davison
mailto:davison at socal.rr.com



More information about the Corpora mailing list