Arabic-L:GEN:Arabic from PDF responses

Dilworth Parkinson Dilworth_Parkinson at BYU.EDU
Thu Jun 1 22:56:21 UTC 2006


------------------------------------------------------------------------
Arabic-L: Thu 01 Jun 2006
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
            unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:Arabic from PDF response
2) Subject:Arabic from PDF response
3) Subject:Arabic from PDF response
4) Subject:Arabic from PDF response
5) Subject:thanks


-------------------------Messages-----------------------------------
1)
Date: 01 Jun 2006
From:setaylor at ma.ultranet.com
Subject:Arabic from PDF response

You wrote on  31 May 2006

If a pdf file isn't protected, I can usually choose the text and copy
the Arabic from it into an application like textedit on the mac and
it works fine. I do it all the time. However, I have been given
some pdf files which have Arabic in them, but when I copy the text
into any other program, it turns to garbage. Is there anyone out
there who can explain this to me?


This isn't an authoritative reply, but here's what I think:
First the facts I can contribute:
PDF (and Postscript) files have the capability of having their fonts  
packaged with them.  This allows the composer to
do things like use unusual encodings.
For example, ps files produced by tex have unusual encodings for  
ligatures (fi is a typical English ligature, which may be represented  
in a custom font by a single code.  Tex output actually does this.)
In days of yore, when disks were smaller, I think that some programs  
actually dropped characters which were not used in the document out  
of the fonts packaged with it.

And here's my bluesky fantasy:
The program which prepared your PDF document used a non-standard  
encoding, possibly for the very good reason that they wished to have  
a lot of ligatures in the document.  I'd guess that you could write a  
short program to fix the encoding, but you'd probably have to build  
the translation table by comparing the visual output to the garbage  
grabbed.  Worse, if the original program optimized the encoding  
according to the document content, you might have to build a separate  
table for each document you wanted to get text from.

Stephen Taylor

------------------------------------------------------------------------ 
--
2)
Date: 01 Jun 2006
From:ejp10 at psu.edu
Subject:Arabic from PDF response

This is just a guess, but it may depend on which fonts the original  
document was using and what the encoding is.

I think PDF files actually embed fonts within them. If the original  
document used a Unicode font or some other standard, then when you  
copy and paste, you still have Unicode (or whatever).

But if the document is using an older font not matching a standard  
Arabic encoding, it's possible that you would have to have the  
matching font installed in order to cut and paste.

The other hypothesis is that something went wrong during the PDF  
conversion, possibly because the user was using an older tool.

Elizabeth
=-=-=-=-=-=-=-=-=-=-=-=-=
Elizabeth J. Pyatt, Ph.D.
Instructional Designer
Education Technology Services, TLT/ITS
Penn State University
ejp10 at psu.edu, (814) 865-0805 or (814) 865-2030 (Main Office)


------------------------------------------------------------------------ 
--
3)
Date: 01 Jun 2006
From:medawar at panix.com
Subject:Arabic from PDF response

Hi Dil,

The PDF is using nonstandard encoding.  This is achieved by including  
into the PDF nonstandard Arabic fonts.

bassem

------------------------------------------------------------------------ 
--
4)
Date: 01 Jun 2006
From:wasamy at umich.edu
Subject:Arabic from PDF response

Are the Arabic PDF files from the same source?
I would look to determine why there is this character encoding  
difference.
It may be due to operating systems differences.  It might also be due  
to the
original application that the Arabic document was created with.
Waheed

------------------------------------------------------------------------ 
--
4)
Date: 01 Jun 2006
From:dil at byu.edu
Subject:thanks

Thanks for the responses.  I somehow thought that if it was in pdf it  
had a single encoding, but I now realize that that is wrong.  It  
could have any encoding and still be in pdf.
dil


------------------------------------------------------------------------ 
--
End of Arabic-L:  01 Jun 2006



More information about the Arabic-l mailing list