Corpora: Converting PDF files

Fri Dec 28 19:27:02 UTC 2001

Have any of you tried the suite of PDF translation products offered by BCL
computers? Check:

http://www.bcl-computers.com/

Regards,
   Michael O'Connell

On Fri, 28 Dec 2001, Tolkin, Steve wrote:

> Oh, I wish it were so easy!
>
> Summary:
> I believe there are several problems that affect all the approaches.
> 1. Ligatures e.g. fi, ff, ffi, Fi, etc. are emitted as special
> control characters, e.g. the single character ^L.
> 2. Words that had a hyphen introduced due to a line ending
> are emitted in two pieces.
>
> Details:
> 1. Just as an example here is the last part of page 7 of
> http://www.cs.columbia.edu/~min/papers/cucs-002-01.pdf
> that I created by copying with the text tool and then pasting into my
> editor (emacs).  Note that I have replaced the actual single
> characters ^L and ^K by a two character pair so you would see them in
> this email.  The original file contained a single character ^L (aka
> Control-l, C-l, octal 014, hexadecimal 0xc etc.)  Note also that ^L is
> used for two different purposes: for the ligature fi and to denote a
> page break.  ^K is used for "ff".
>
> <quote>
> The relative di^Kerence between these features across headers within a
> document seems to dictate their nesting depth. Header thus computes
> its ^Lnal feature set based on the di^Kerences in the values of these
> initial features in adjacent headers, shown in Table 3. This
> corresponds to learning whether one header dominates, is dominated by,
> or is on parity with an adjacent header. These pairwise features are
> Header's output and are passed on to the Combiner ^Lnal machine
> learning module.
> 7
> ^L
> </quote>
> Unfortunately the approach of having the file read by
> Ghostview (and processed by Ghostscript) is even worse.
> All the above errors appear, as well as another kind of error where it
> cannot
> read the contents due to some font problem or other issue,
> and so uses ### instead, e.g. the last sentence becomes:
> <quote>
> These pairwise features are ######'s output and are
> passed on to the ######## ^Lnal machine learning module.
> </quote>
>
> Unfortuantely there are many more ligatures than this, e.g. fl,
> including some with three letters: ffi, etc.  They also
> can occur anywhere in a word, e.g. specific became "speci^Lc".
>
> I seem to recall that the particular assignments used by Acrobat,
> i.e. which control code is used for which ligature,
> vary.  (If anyone could provide more information about
> this I would appreciate it.)
>
> Assuming you have a big dictionary this problem can be
> partially remedied as follows:
> Find all words containing a ligature and scan the text
> looking for the assignment (i.e. on a per document level).
> Then fix them using the inferred mapping.
>
> Aside: This is similar to the problem with ligatures in *.ps files
> which the ps2text program tries to fix, e.g. here is an excerpt:
> <quote>
> #
> #  Process the filtered PostScript with $ps2txt_cmd and clean up its output.
> #  Substitute \ddd characters with correct combinations.
> #
> open(PS2TXT, "$ps2txt_cmd $dviflag < $tmpfile |") || die "Cannot run
> ps2txt";
> while (<PS2TXT>) {
> 	next if (/^\n/o);
> 	chop;
> 	if (/^.*\\.*$/o) {
> 		s/\\214/fi/g;
> 		s/\\256/fi/g;
> 		s/\\257/fl/g;
> 		s/\\320//g;
> </quote>
>
> 2. When converting Adobe Acrobat *.pdf file to text
> there are often many hyphenated words.
> Here is an example from p. 11 of the same document above.
> <quote>
> To further analyze CLASP's performance,
> we assess the features used by Ripper, since it implicitly does feature
> selec-
> tion when constructing its hypothesis.
> </quote>
>
> In certain cases the frequency of hyphenated words is very high.
> For example the U.S. IRS presents its publications
> using 3 columns, and so there are many hyphenated words introduced.
>
> Assuming you have a big dictionary this problem can be
> partially remedied as follows:
> If removing the hyphen produces a word, and neither fragment
> is a word then we simply store the word, e.g.
> "ap-propriate" becomes "appropriate".
> My coinage for this process: "dehyphenization".
>
> Requests for Additional information:
>
> If anyone has tools, e.g. in perl, to perform either of the
> fix up workarounds above I would like to know about them.
>
> It may be that these problems can be minimized by
> the use of some options when creating the *.pdf file.
> If so I would like to learn about that.  (But I believe
> once the file is created you are stuck.)
>
> Google seems to have a decent *.pdf to *.html convertor
> and I would be interested in any information about that.
>
>
> Hopefully helpfully yours,
> Steve
> --
> Steven Tolkin          steve.tolkin at fmr.com      617-563-0516
> Fidelity Investments   82 Devonshire St. V1D     Boston MA 02109
> There is nothing so practical as a good theory.  Comments are by me,
> not Fidelity Investments, its subsidiaries or affiliates.
>
> > -----Original Message-----
> > From: ramesh at clg2.bham.ac.uk [mailto:ramesh at clg2.bham.ac.uk]
> > Sent: Friday, December 28, 2001 9:55 AM
> > To: corpora at hd.uib.no
> > Subject: Corpora: Converting PDF files
> >
> >
> >
> > Dear All
> >
> > In May 2001, I asked:
> > I'm working on a PC with Windows95.
> > I have MSWord 2000, Acrobat Reader5, and GSview3.6.
> > Can anyone tell me if it is possible to convert
> > PDF files into ASCII or MSWord?
> > And how....
> >
> > I received many helpful replies, and
> > promised to post a summary, but forgot.
> >
> > A colleague has just asked me about the same problem,
> > which reminded me that I did not post the summary.
> >
> > So here it is. Apologies to anyone I have
> > forgotten.
> >
> > Best
> > Ramesh Krishnamurthy
> > Consultant: COBUILD, Collins Dictionaries.
> > Hon. Res. Fellow: University of Birmingham.
> > Hon. Res. Fellow: University of Wolverhampton.
> >
> >
> > 1. Kevin McTait (UMIST):
> > try the auto-email service at:
> > http://www.pdfzone.com/services/access.html
> >
> > 2. Ha Le An (Wolverhampton Uni):
> > the simplest way is select all, copy from Acrobat Reader, and
> > paste into
> > word, but there is no way to keep the format, and images, and
> > tables etc.
> >
> > 3. Fabio Tamburini (Bologna):
> > Open the file with GhostView, then choose menu EDIT, then "Text
> > Extract..." and an ASCII text file will be produced...
> > Pay attention to the formatting of the new file! ;-)
> > I have GSview3.3, but such feature should be available also in 3.6...
> >
> > 4. Mike Scott (Liverpool):
> > Adobe Acrobat, the full version, not just the Reader,
> > will export to various formats, haven't checked
> > them all yet though.
> >
> > 5. Chris Tribble (Sri Lanka):
> > I do this with the full Acrobat - I use version 4.  This has a text
> > selection tool.  Once you've clicked on this you can use Ctrl
> > A to select
> > all text in the documenn if you've selected View, Continuous.
> >  This text can
> > then be pasted to a notepad or word document.
> >
> > 6. Acrobat has an export to Postscript option. Then you can use a
> > `postscript-to-text' converter.
> >
> > 7. Everita Milconoka (Latvia):
> > You may try to send your .pdf file to
> > access-b at Adobe.COM
> > and then in subject line you have to write either pdf2txt or pdf2htm,
> > and after some minutes they will send you back the file in
> > .txt or .htm
> > format.
> >
> > 8. Steven Krauwer (Netherlands):
> > Adobe offers on-line and email facilities for this
> > at http://access.adobe.com:80/simple_form.html
> >
> > 9. Philip Resnik (Maryland):
> > The solution was at
> >  http://www.research.compaq.com/SRC/virtualpaper/pstotext.html --
> > it seems to work very nicely for pdf2txt conversion at least
> > in the Unix version.
> >
> > 10. Simon G. J. Smith (Birmingham):
> > MSword -- www.adobe.com will do free conversions FROM word
> > (they get emailed
> > back to you, and you can only do abt 5 per email address),
> > but I don't know about the other way round.
> > To extract text from acrobat (mine is 4.0) choose the text
> > select tool (capital T with a little
> >  box). Then just cut and paste the text you want. This works
> > one page at a time.
> > From ghostview (if it can read your particular PDF, sometimes
> > doesn't work for
> >  me), do the whole thing at once by Edit|Text Extract. It's
> > in the gsview help.
> > You can convert whole pages to bitmaps with gsview, and I
> > think in Acrobat you
> > can select graphics from the pdf file (the Acrobat help says
> > use the graphics select
> > tool, but I can't find this tool). The bitmap file can then
> > be viewed from Word.
> >
> > 14. Jerome Richalot (Lyon)
> > Acrobat 5 apparently makes the whole difference. You can
> > download a plug-in from adobe.com called Access and add it on
> > Acrobat to
> > convert from pdf to rtf.
> >
>
>