[Corpora-List] Tools for batch conversion Word to UTF-8.

Tino Didriksen tino at didriksen.cc
Thu Feb 9 12:07:33 UTC 2012


Modern MS Word .docx files are ZIPs with XML documents, which don't require
much scripting to extract plain text from.

Older .doc files will need a trip through some tool. It is possible to use
OpenOffice/LibreOffice in headless mode for this, and OOo/LO's Office
reader gets most of the doc format right.

-- Tino Didriksen

On Thu, Feb 9, 2012 at 12:38, Josep M. Fontana <josepm.fontana at upf.edu>wrote:

> Does anyone here know of a good free application to batch convert Word
> documents to UTF-8? (Linux, OS X or Windows, it doesn't matter)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120209/02f77222/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list