Corpora: Converting PDF files
Tolkin, Steve
Steve.Tolkin at FMR.COM
Fri Dec 28 16:43:01 UTC 2001
Oh, I wish it were so easy!
Summary:
I believe there are several problems that affect all the approaches.
1. Ligatures e.g. fi, ff, ffi, Fi, etc. are emitted as special
control characters, e.g. the single character ^L.
2. Words that had a hyphen introduced due to a line ending
are emitted in two pieces.
Details:
1. Just as an example here is the last part of page 7 of
http://www.cs.columbia.edu/~min/papers/cucs-002-01.pdf
that I created by copying with the text tool and then pasting into my
editor (emacs). Note that I have replaced the actual single
characters ^L and ^K by a two character pair so you would see them in
this email. The original file contained a single character ^L (aka
Control-l, C-l, octal 014, hexadecimal 0xc etc.) Note also that ^L is
used for two different purposes: for the ligature fi and to denote a
page break. ^K is used for "ff".
<quote>
The relative di^Kerence between these features across headers within a
document seems to dictate their nesting depth. Header thus computes
its ^Lnal feature set based on the di^Kerences in the values of these
initial features in adjacent headers, shown in Table 3. This
corresponds to learning whether one header dominates, is dominated by,
or is on parity with an adjacent header. These pairwise features are
Header's output and are passed on to the Combiner ^Lnal machine
learning module.
7
^L
</quote>
Unfortunately the approach of having the file read by
Ghostview (and processed by Ghostscript) is even worse.
All the above errors appear, as well as another kind of error where it
cannot
read the contents due to some font problem or other issue,
and so uses ### instead, e.g. the last sentence becomes:
<quote>
These pairwise features are ######'s output and are
passed on to the ######## ^Lnal machine learning module.
</quote>
Unfortuantely there are many more ligatures than this, e.g. fl,
including some with three letters: ffi, etc. They also
can occur anywhere in a word, e.g. specific became "speci^Lc".
I seem to recall that the particular assignments used by Acrobat,
i.e. which control code is used for which ligature,
vary. (If anyone could provide more information about
this I would appreciate it.)
Assuming you have a big dictionary this problem can be
partially remedied as follows:
Find all words containing a ligature and scan the text
looking for the assignment (i.e. on a per document level).
Then fix them using the inferred mapping.
Aside: This is similar to the problem with ligatures in *.ps files
which the ps2text program tries to fix, e.g. here is an excerpt:
<quote>
#
# Process the filtered PostScript with $ps2txt_cmd and clean up its output.
# Substitute \ddd characters with correct combinations.
#
open(PS2TXT, "$ps2txt_cmd $dviflag < $tmpfile |") || die "Cannot run
ps2txt";
while (<PS2TXT>) {
next if (/^\n/o);
chop;
if (/^.*\\.*$/o) {
s/\\214/fi/g;
s/\\256/fi/g;
s/\\257/fl/g;
s/\\320//g;
</quote>
2. When converting Adobe Acrobat *.pdf file to text
there are often many hyphenated words.
Here is an example from p. 11 of the same document above.
<quote>
To further analyze CLASP's performance,
we assess the features used by Ripper, since it implicitly does feature
selec-
tion when constructing its hypothesis.
</quote>
In certain cases the frequency of hyphenated words is very high.
For example the U.S. IRS presents its publications
using 3 columns, and so there are many hyphenated words introduced.
Assuming you have a big dictionary this problem can be
partially remedied as follows:
If removing the hyphen produces a word, and neither fragment
is a word then we simply store the word, e.g.
"ap-propriate" becomes "appropriate".
My coinage for this process: "dehyphenization".
Requests for Additional information:
If anyone has tools, e.g. in perl, to perform either of the
fix up workarounds above I would like to know about them.
It may be that these problems can be minimized by
the use of some options when creating the *.pdf file.
If so I would like to learn about that. (But I believe
once the file is created you are stuck.)
Google seems to have a decent *.pdf to *.html convertor
and I would be interested in any information about that.
Hopefully helpfully yours,
Steve
--
Steven Tolkin steve.tolkin at fmr.com 617-563-0516
Fidelity Investments 82 Devonshire St. V1D Boston MA 02109
There is nothing so practical as a good theory. Comments are by me,
not Fidelity Investments, its subsidiaries or affiliates.
> -----Original Message-----
> From: ramesh at clg2.bham.ac.uk [mailto:ramesh at clg2.bham.ac.uk]
> Sent: Friday, December 28, 2001 9:55 AM
> To: corpora at hd.uib.no
> Subject: Corpora: Converting PDF files
>
>
>
> Dear All
>
> In May 2001, I asked:
> I'm working on a PC with Windows95.
> I have MSWord 2000, Acrobat Reader5, and GSview3.6.
> Can anyone tell me if it is possible to convert
> PDF files into ASCII or MSWord?
> And how....
>
> I received many helpful replies, and
> promised to post a summary, but forgot.
>
> A colleague has just asked me about the same problem,
> which reminded me that I did not post the summary.
>
> So here it is. Apologies to anyone I have
> forgotten.
>
> Best
> Ramesh Krishnamurthy
> Consultant: COBUILD, Collins Dictionaries.
> Hon. Res. Fellow: University of Birmingham.
> Hon. Res. Fellow: University of Wolverhampton.
>
>
> 1. Kevin McTait (UMIST):
> try the auto-email service at:
> http://www.pdfzone.com/services/access.html
>
> 2. Ha Le An (Wolverhampton Uni):
> the simplest way is select all, copy from Acrobat Reader, and
> paste into
> word, but there is no way to keep the format, and images, and
> tables etc.
>
> 3. Fabio Tamburini (Bologna):
> Open the file with GhostView, then choose menu EDIT, then "Text
> Extract..." and an ASCII text file will be produced...
> Pay attention to the formatting of the new file! ;-)
> I have GSview3.3, but such feature should be available also in 3.6...
>
> 4. Mike Scott (Liverpool):
> Adobe Acrobat, the full version, not just the Reader,
> will export to various formats, haven't checked
> them all yet though.
>
> 5. Chris Tribble (Sri Lanka):
> I do this with the full Acrobat - I use version 4. This has a text
> selection tool. Once you've clicked on this you can use Ctrl
> A to select
> all text in the documenn if you've selected View, Continuous.
> This text can
> then be pasted to a notepad or word document.
>
> 6. Acrobat has an export to Postscript option. Then you can use a
> `postscript-to-text' converter.
>
> 7. Everita Milconoka (Latvia):
> You may try to send your .pdf file to
> access-b at Adobe.COM
> and then in subject line you have to write either pdf2txt or pdf2htm,
> and after some minutes they will send you back the file in
> .txt or .htm
> format.
>
> 8. Steven Krauwer (Netherlands):
> Adobe offers on-line and email facilities for this
> at http://access.adobe.com:80/simple_form.html
>
> 9. Philip Resnik (Maryland):
> The solution was at
> http://www.research.compaq.com/SRC/virtualpaper/pstotext.html --
> it seems to work very nicely for pdf2txt conversion at least
> in the Unix version.
>
> 10. Simon G. J. Smith (Birmingham):
> MSword -- www.adobe.com will do free conversions FROM word
> (they get emailed
> back to you, and you can only do abt 5 per email address),
> but I don't know about the other way round.
> To extract text from acrobat (mine is 4.0) choose the text
> select tool (capital T with a little
> box). Then just cut and paste the text you want. This works
> one page at a time.
> From ghostview (if it can read your particular PDF, sometimes
> doesn't work for
> me), do the whole thing at once by Edit|Text Extract. It's
> in the gsview help.
> You can convert whole pages to bitmaps with gsview, and I
> think in Acrobat you
> can select graphics from the pdf file (the Acrobat help says
> use the graphics select
> tool, but I can't find this tool). The bitmap file can then
> be viewed from Word.
>
> 14. Jerome Richalot (Lyon)
> Acrobat 5 apparently makes the whole difference. You can
> download a plug-in from adobe.com called Access and add it on
> Acrobat to
> convert from pdf to rtf.
>
More information about the Corpora
mailing list