Arabic-L:GEN:Arabic from PDF responses
Dilworth Parkinson
Dilworth_Parkinson at BYU.EDU
Thu Jun 1 22:56:21 UTC 2006
------------------------------------------------------------------------
Arabic-L: Thu 01 Jun 2006
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:Arabic from PDF response
2) Subject:Arabic from PDF response
3) Subject:Arabic from PDF response
4) Subject:Arabic from PDF response
5) Subject:thanks
-------------------------Messages-----------------------------------
1)
Date: 01 Jun 2006
From:setaylor at ma.ultranet.com
Subject:Arabic from PDF response
You wrote on 31 May 2006
If a pdf file isn't protected, I can usually choose the text and copy
the Arabic from it into an application like textedit on the mac and
it works fine. I do it all the time. However, I have been given
some pdf files which have Arabic in them, but when I copy the text
into any other program, it turns to garbage. Is there anyone out
there who can explain this to me?
This isn't an authoritative reply, but here's what I think:
First the facts I can contribute:
PDF (and Postscript) files have the capability of having their fonts
packaged with them. This allows the composer to
do things like use unusual encodings.
For example, ps files produced by tex have unusual encodings for
ligatures (fi is a typical English ligature, which may be represented
in a custom font by a single code. Tex output actually does this.)
In days of yore, when disks were smaller, I think that some programs
actually dropped characters which were not used in the document out
of the fonts packaged with it.
And here's my bluesky fantasy:
The program which prepared your PDF document used a non-standard
encoding, possibly for the very good reason that they wished to have
a lot of ligatures in the document. I'd guess that you could write a
short program to fix the encoding, but you'd probably have to build
the translation table by comparing the visual output to the garbage
grabbed. Worse, if the original program optimized the encoding
according to the document content, you might have to build a separate
table for each document you wanted to get text from.
Stephen Taylor
------------------------------------------------------------------------
--
2)
Date: 01 Jun 2006
From:ejp10 at psu.edu
Subject:Arabic from PDF response
This is just a guess, but it may depend on which fonts the original
document was using and what the encoding is.
I think PDF files actually embed fonts within them. If the original
document used a Unicode font or some other standard, then when you
copy and paste, you still have Unicode (or whatever).
But if the document is using an older font not matching a standard
Arabic encoding, it's possible that you would have to have the
matching font installed in order to cut and paste.
The other hypothesis is that something went wrong during the PDF
conversion, possibly because the user was using an older tool.
Elizabeth
=-=-=-=-=-=-=-=-=-=-=-=-=
Elizabeth J. Pyatt, Ph.D.
Instructional Designer
Education Technology Services, TLT/ITS
Penn State University
ejp10 at psu.edu, (814) 865-0805 or (814) 865-2030 (Main Office)
------------------------------------------------------------------------
--
3)
Date: 01 Jun 2006
From:medawar at panix.com
Subject:Arabic from PDF response
Hi Dil,
The PDF is using nonstandard encoding. This is achieved by including
into the PDF nonstandard Arabic fonts.
bassem
------------------------------------------------------------------------
--
4)
Date: 01 Jun 2006
From:wasamy at umich.edu
Subject:Arabic from PDF response
Are the Arabic PDF files from the same source?
I would look to determine why there is this character encoding
difference.
It may be due to operating systems differences. It might also be due
to the
original application that the Arabic document was created with.
Waheed
------------------------------------------------------------------------
--
4)
Date: 01 Jun 2006
From:dil at byu.edu
Subject:thanks
Thanks for the responses. I somehow thought that if it was in pdf it
had a single encoding, but I now realize that that is wrong. It
could have any encoding and still be in pdf.
dil
------------------------------------------------------------------------
--
End of Arabic-L: 01 Jun 2006
More information about the Arabic-l
mailing list