batch conversion to pdf

Aidan Wilson a.wilson at PGRAD.UNIMELB.EDU.AU
Wed Oct 27 00:20:05 UTC 2010

All true. But it's better than archiving in .doc format. Take the current 
situation with .docx as an example; Microsoft no longer support their own 
propriatary formats (.doc, .ppt, .xls, .mdb, etc) and to read them in the 
newest Office suite, you must download the 'compatibility pack'. The reason is 
of course that other software engineers and manufacturers like Sun Microsystems 
have reverse engineered these formats and make software that can read and write 
to them easily. So Microsoft, understandably, is oriented towards control of 
their formats - an aim that is largely incompatible with those of the 

Adobe, by contrast, have released .pdf as an open standard format, making it 
quite reliable for archive. To respond to your concerns about indexing and 
searchability, most pdf files (and pdf creation tools, printers, etc) encode 
character information in a text file layer. It's not perfect (try to copy/paste 
the text from a pdf and you'll quickly see why), but it will eventually improve 
to the point where merely by printing to pdf, it will encode a text only 
version as a sublayer, making it just as searchable as .doc.

Alternatively, you could copy/paste the contents out of a word doc and archive 
as a raw text file (in addition to pdf). It'd consume negligibly little storage 

Aidan Wilson

PhD Candidate
Dept of Linguistics and Applied Linguistics
The University of Melbourne

+61428 458 969
a.wilson at

On Wed, 27 Oct 2010, Andrew Cunningham wrote:

> I'm just wondering if PDF files are suitable as an archival format,
> since it is in essence a preprint format rather than an archival
> format
> This may be more of a concern with languages written in complex
> scripts (including Latin and Cyrillic script languages that need to be
> treated as complex scripts), where a PDF document will be
> glyph-centric rather than character-centric; affecting searchability,
> indexing and text extraction.
> Andrew
> On 27 October 2010 04:36, Gary Holton <gmholton at> wrote:
>> Here at ANLA we are often faced with the problem of archiving vast
>> numbers of digital files in proprietary formats, especially MS Word.
>> Does anyone know of a good method for batch converting from, say, .doc
>> to .pdf ?

More information about the Resource-network-linguistic-diversity mailing list