[Corpora-List] Tools for batch conversion Word to UTF-8.

Jakub Piskorski jakub.piskorski at frontex.europa.eu
Thu Feb 9 12:26:56 UTC 2012


Maybe be this will come in handy as well:
http://poi.apache.org/ 

cheers,

Jakub

> -----Original Message-----
> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Tino
> Didriksen
> Sent: 09 February 2012 13:08
> To: Josep M. Fontana
> Cc: corpora at uib.no
> Subject: Re: [Corpora-List] Tools for batch conversion Word to UTF-8.
> 
> Modern MS Word .docx files are ZIPs with XML documents, which don't require
> much scripting to extract plain text from.
> 
> Older .doc files will need a trip through some tool. It is possible to use
> OpenOffice/LibreOffice in headless mode for this, and OOo/LO's Office reader gets
> most of the doc format right.
> 
> -- Tino Didriksen
> 
> 
> On Thu, Feb 9, 2012 at 12:38, Josep M. Fontana <josepm.fontana at upf.edu> wrote:
> 
> 
> 	Does anyone here know of a good free application to batch convert Word
> documents to UTF-8? (Linux, OS X or Windows, it doesn't matter)
> 
> 

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list